分享一下近期的Pandas学习笔记~
示例
axes import pandas as pd import numpy as np ## Create a series with 100 random numbers s = pd.Series(np.random.randn(4)) print("The axes are:") print(s.axes)输出结果:
>>> The axes are: >>> [RangeIndex(start=0, stop=4, step=1)]]上述结果是0到5的值列表的紧凑格式,即:[0,1,2,3,4]。
values import pandas as pd import numpy as np ## Create a series with 4 random numbers s = pd.Series(np.random.randn(4)) print s print("The actual data series is:") print(s.values)输出结果:
0 1.787373 1 -0.605159 2 0.180477 3 -0.140922 dtype: float64 The actual data series is: [ 1.78737302 -0.605158851 0.18047664 -0.1409218]示例
axes import pandas as pd import numpy as np #Create a Dictionary of series d = {'Name': pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack'], 'Age': pd.Series([25,26,25,23,30,29,23]), 'Rating': pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} #Create a DataFrame df = pd.DataFrame(d) print("Row axis labels and column axis labels are:") print(df.axes)输出结果:
Row axis labels and column axis labels are: [RangeIndex(start=0, stop=7, step=1), Index([u'Age',u'Name',u'Rating'], dtype='object')] values:将DataFrame中的实际数据作为NDarray返回 import pandas as pd import numpy as np #Create a Dictionary of series d = {'Name': pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack'], 'Age': pd.Series([25,26,25,23,30,29,23]), 'Rating': pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} #Create a DataFrame df = pd.DataFrame(d) print("The actual data in our data frame is:") print(df.values) print(type(df.values))输出结果:
The actual data in our data frame is: [[25 'Tom' 4.23] [26 'James' 3.24] [25 'Ricky' 3.98] [23 'Vin' 2.56] [30 'Steve' 3.2] [29 'Minsu' 4.6] [23 'Jack' 3.8]] <class 'numpy.ndarray'>有很多方法来聚合计算DataFrame的描述性统计信息和其他相关操作。其中大多数是sum(),mean()等聚合函数,但其中一些,如sumsum(),产生一个相同大小的对象。一般来说,这些方法采用轴参数,就像ndarray.{sum, std, …},但轴可以通过名称或整数来指定:
index (axis=0, 默认)columns (axis=1)重要参数: include
object:汇总字符串列number:汇总数字列all:将所有列汇总在一起示例
import pandas as pd import numpy as np #Create a Dictionary of series d = {'Name': pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack','Lee','David','Gasper','Betina','Andres']), 'Age': pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating': pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])} #Create a DataFrame df = pd.DataFrame(d) print(df.describe()) print(df.describe(include=['object'])) print(df.describe(include='all'))输出结果:
[OUTPUT-1] Age Rating count 12.000000 12.000000 mean 31.833333 3.743333 std 9.232682 0.661628 min 23.000000 2.560000 25% 25.000000 3.230000 50% 29.500000 3.790000 75% 35.500000 4.132500 max 51.000000 4.800000 [OUTPUT-2] Name count 12 unique 12 top Ricky freq 1 [OUTPUT-3] Age Name Rating count 12.000000 12 12.000000 unique NaN 12 NaN top NaN Ricky NaN freq NaN 1 NaN mean 31.833333 NaN 3.743333 std 9.232682 NaN 0.661628 min 23.000000 NaN 2.560000 25% 25.000000 NaN 3.230000 50% 29.500000 NaN 3.790000 75% 35.500000 NaN 4.132500 max 51.000000 NaN 4.800000可以通过将函数和适当数量的参数作为管道参数来执行自定义操作。 提示:pipe()方法不改变原DataFrame,返回一个新的DataFrame。 示例
import pandas as pd import numpy as np def adder(ele1,ele2): return ele1+ele2 df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']) print(df) print(df.pipe(adder,2))输出结果:
[Running] set PYTHONIOENCODING=utf8 && python -u "c:\Users\Administrator\Desktop\tester.py" col1 col2 col3 0 -1.002507 0.378120 -0.190947 1 0.596924 -0.199729 -0.532229 2 -2.944083 -1.040721 -1.952132 3 -0.423783 -0.593296 -0.461555 4 1.799471 0.552055 0.826377 col1 col2 col3 0 0.997493 2.378120 1.809053 1 2.596924 1.800271 1.467771 2 -0.944083 0.959279 0.047868 3 1.576217 1.406704 1.538445 4 3.799471 2.552055 2.826377 [Done] exited with code=0 in 0.822 seconds官方文档:
#Use .pipe when chaining together functions that expect Series, DataFrames or Groupby objects. #Instead of writing: >>> func(g(h(df), arg1=a), arg2=b, arg3=c) #You can write: >>> df.pipe(h).pipe(g, arg1=a).pipe(func, arg2=b, arg3=c)可以使用apply()方法沿DataFrame的轴应用任意函数(axis参数)。默认情况下,操作按列执行,将每列列为数组。
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds) Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None only act when axis=1
expand: list-like results will be turned into columns.reduce: returns a Series if possible.broadcast: results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.示例-1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3']) print(df) print(df.apply(np.mean)) print(df.apply(np.mean,axis=1))输出结果:
[OUTPUT-1] col1 col2 col3 0 0.071918 -0.933174 -0.458476 1 -0.512443 0.455947 0.552345 2 -0.035369 0.563239 -2.477740 3 -1.204645 1.383545 -1.124751 4 0.872696 -0.702149 0.360365 [OUTPUT-2] col1 -0.161569 col2 0.153481 col3 -0.629651 dtype: float64 [OUTPUT-3] 0 -0.439911 1 0.165283 2 -0.649957 3 -0.315284 4 0.176971 dtype: float64示例-2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3), columns=['col1','col2','col3']) df_1 = df.apply(lambda x: x.max() - x.min()) print(df) print(df_1)输出结果
col1 col2 col3 0 -0.030535 0.165539 1.048389 1 1.679696 -0.845977 -0.597818 2 0.638122 -2.360784 -1.897171 3 0.529273 -1.702445 -0.679899 4 -0.287787 0.065303 0.120348 col1 1.967483 col2 2.526324 col3 2.945560 dtype: float64去官方文档找补充说明
在DataFrame上的方法.applymap()和类似于在Series上的map()接受任何python函数,并返回单个值。
Series.map(arg, na_action=None) Map valuesof Series according to input correspondence.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a series. arg: function, collections.abc.Mapping subclass or Series na_action: {None, ‘ignore’}, default None
if ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.DataFrame.applymap(func) Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame. Returns a DataFrame. func: callable (Python function, returns a single value from a single value.)
示例-1
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3), columns=['col1','col2','col3']) df_col1 = df['col1'].map(lambda x: x*100) print(df) print(df_col1)输出结果:
col1 col2 col3 0 0.654253 -1.337191 0.609194 1 -1.319780 -0.525150 -1.183926 2 1.325296 2.053831 -0.414354 3 0.947637 -1.838234 -0.615808 4 -2.769647 0.517323 1.485486 0 65.425259 1 -131.977998 2 132.529611 3 94.763670 4 -276.964732 Name: col1, dtype: float64示例-2
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5,3), columns=['col1','col2','col3']) df_1 = df.applymap(lambda x: x*100) print(df) print(df_1)输出结果:
col1 col2 col3 0 -1.782354 1.762709 2.508919 1 0.068842 -0.120896 0.224085 2 -0.159717 -0.589034 0.606623 3 2.335862 -0.977052 -0.814304 4 0.544693 2.109825 0.925263 col1 col2 col3 0 -178.235406 176.270857 250.891855 1 6.884196 -12.089569 22.408473 2 -15.971700 -58.903353 60.662266 3 233.586210 -97.705154 -81.430437 4 54.469309 210.982525 92.526264DataFrame.reindex(**kwargs) Confrom Series/DataFrame to new index with optional filling logic.
keywords for axes: array-like, optional method: {None, backfill/bfill, pad/ffill, nearest}
Method to use for filling holes in reindexed DataFrame.copy: bool, default True level: int or name
DataFrame.reindex supports two calling conventions
(index=index_labels, columns=column_labels,…)(labels, axis={‘index’,‘columns’},…)官方示例
#Create a dataframe with some fictional data. import pandas as pd idx = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] df = pd.DataFrame({ 'http_status': [200, 200, 404, 404, 301], 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0], index = idx }) new_idx = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome'] df_1 = df.reindex(new_idx) df_2 = df.reindex(new_idx, fill_value=0) df_3 = df.reindex(['http_status', 'user_agent'], axis='columns') print(df) print(df_1) print(df_2) print(df_3)输出结果:
[df] http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00 [df_1] http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02 [df_2] http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02 [df_3] http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaNDataFrame.reindex_like(other, method=None, copy=True, limit=None, tolerance=None)
官方示例
import pandas as pd df1 = pd.DataFrame([ [24.3, 75.7, 'high'], [31, 87.8, 'high'], [22, 71.6, 'medium'], [35, 95, 'medium'] ], columns=['temp_celsius', 'temp_fahrenheit', 'windspeed'], index=pd.date_range(start='2020-10-09', end='2020-10-12', freq='D') ) df2 = pd.DataFrame([ [28, 'low'], [30, 'low'], [35.1, 'medium'] ], columns=['temp_celsius', 'windspeed'], index=pd.date_range(start='2020-10-09', end='2020-10-11', freq='D') ) print('df1:\n', df1) print('df2:\n', df2) df3=df2.reindex_like(df1) print('df3')输出结果:
df1: temp_celsius temp_fahrenheit windspeed 2020-10-09 24.3 75.7 high 2020-10-10 31.0 87.8 high 2020-10-11 22.0 71.6 medium 2020-10-12 35.0 95.0 medium df2: temp_celsius windspeed 2020-10-09 28.0 low 2020-10-10 30.0 low 2020-10-11 35.1 medium df3: temp_celsius temp_fahrenheit windspeed 2020-10-09 28.0 NaN low 2020-10-10 30.0 NaN low 2020-10-11 35.1 NaN medium 2020-10-12 NaN NaN NaNPandas对象之间的基本迭代取决于其类型。当迭代一个系列时,它被视为数组。基本迭代产生:
Series - 值DataFrame - 列标签具体规则
迭代DataFrame提供列名。iteritems() - 迭代(key, value)对iterrows() - 迭代(索引,系列)对itertuples() - 以namedtuples迭代行示例
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(4,3), columns=['A','B','C'], index = pd.date_range(start='2020-10-09', end='2020-10-12', freq='D')) print(df) # 1-迭代列名 for col in df: print(col) # 2-迭代(key,value) for key, value in df.iteritems(): print(key,value) # 3-迭代(index, Series) for i, s in df.iterrows(): print(i, s) # 4-迭代namedtuples for row in df.itertuples(): print(row)输出结果:
A B C 2020-10-09 -1.030271 -0.906981 -0.659338 2020-10-10 0.767799 -2.445025 -0.174059 2020-10-11 2.696855 -0.277801 0.759680 2020-10-12 0.062061 -0.026966 -1.539570 # 1-迭代列名 A B C # 2-迭代(key,value) A 2020-10-09 -1.030271 2020-10-10 0.767799 2020-10-11 2.696855 2020-10-12 0.062061 Freq: D, Name: A, dtype: float64 B 2020-10-09 -0.906981 2020-10-10 -2.445025 2020-10-11 -0.277801 2020-10-12 -0.026966 Freq: D, Name: B, dtype: float64 C 2020-10-09 -0.659338 2020-10-10 -0.174059 2020-10-11 0.759680 2020-10-12 -1.539570 Freq: D, Name: C, dtype: float64 # 3-迭代(index, Series) 2020-10-09 00:00:00 A -1.030271 B -0.906981 C -0.659338 Name: 2020-10-09 00:00:00, dtype: float64 2020-10-10 00:00:00 A 0.767799 B -2.445025 C -0.174059 Name: 2020-10-10 00:00:00, dtype: float64 2020-10-11 00:00:00 A 2.696855 B -0.277801 C 0.759680 Name: 2020-10-11 00:00:00, dtype: float64 2020-10-12 00:00:00 A 0.062061 B -0.026966 C -1.539570 Name: 2020-10-12 00:00:00, dtype: float64 # 4-迭代namedtuples Pandas(Index=Timestamp('2020-10-09 00:00:00', freq='D'), A=-1.0302714533530397, B=-0.9069810243015767, C=-0.659338259231412) Pandas(Index=Timestamp('2020-10-10 00:00:00', freq='D'), A=0.7677989980168033, B=-2.445024835320876, C=-0.1740588651490762) Pandas(Index=Timestamp('2020-10-11 00:00:00', freq='D'), A=2.696855141735038, B=-0.2778005750782568, C=0.7596804345501883) Pandas(Index=Timestamp('2020-10-12 00:00:00', freq='D'), A=0.062061332307071566, B=-0.02696577075586974, C=-1.5395702171663705)Pandas有两类排序方式:
.sort_index() - 按标签.sort_values() - 按实际值DataFrame.sort_index (axis=0, level=None, ascending=True, inplace=False, kind=‘quicksort’, na_position=‘last’, sort_remaining=True, ignore_index=False, key=None)
Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.
axis: {0/‘index’, 1/‘columns’}ascending: bool or list of bools, default Trueinplace: bool, default Falsekind: {quicksort, mergesort, heapsort}, default quicksortsort_remaining: bool, default True ** If true and sorting by level and index is multilevel, sort by other levels too(in order) after sorting by specified level.ignore_index: bool, default False ** If True, the resulting axis will be labeled 0,1,…,n-1.key: callable, optional ** If not None,apply the key function to the index values before sorting.示例
import pandas as pd df = pd.DataFrame({'b': [1, 2, 3, 4], 'A': [4, 3, 2, 1]}, index=['A', 'C', 'b', 'd']) df1 = df.sort_index(key=lambda x: x.str.lower()) df2 = df.sort_index(axis=1, key=lambda x: x.str.lower()) print('df:\n',df) print('df1:\n',df1) print('df2:\n',df2)输出结果:
df: b A A 1 4 C 2 3 b 3 2 d 4 1 df1: b A A 1 4 b 3 2 C 2 3 d 4 1 df2: A b A 4 1 C 3 2 b 2 3 d 1 4DataFrame.sort_values (by, axis=0, ascending=True, inplace=False, kind=‘quicksort’, na_position=‘last’, ignore_index=False, key=None)
Sort by the values along either axis.
by: str or list of str key: callable, optional ** Apply the key function to the values before sorting. na_position: {‘first’, ‘last’}, default ‘last’ ** Puts NaNs at the beginning if ‘first’; ‘last’ puts NaNs at the end.
示例
import pandas as pd import numpy as np df = pd.DataFrame({ 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'], 'col2': [2, 1, 9, 8, 7, 4], 'col3': [0, 1, 9, 4, 2, 3], 'col4': ['a', 'B', 'a', 'D', 'c', 'd'] }) df1 = df.sort_values(by=['col4','col1'], ascending=False, na_position='first', key=lambda col: col.str.lower()) print('df:\n', df) print('df1:\n', df1)输出结果:
df: col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 a 3 NaN 8 4 D 4 D 7 2 c 5 C 4 3 d df1: col1 col2 col3 col4 3 NaN 8 4 D 5 C 4 3 d 4 D 7 2 c 1 A 1 1 B 2 B 9 9 a 0 A 2 0 asort_values()提供了mergesort, heapsort和quicksort三种类型的算法。 具体算法原理需要自己后续进行学习。
Pandas提供了一组字符串函数,可以方便地对字符串数据进行操作,并忽略NaN值。这些方法几乎都是用Python字符串函数,因此,可以将Series对象转换为String对象执行字符串操作。
No.expressiondescription1lower()和upper()大小写转换2len()3strip()删除两侧换行符和空格4split(’’)拆分字符串,默认按空格拆分5cat(sep=’’)使用给定的分隔符链接元素6get_dummies()7contains(pattern)检查是否包含,返回布尔值8replace(a,b)用b替换a9repeat(value)重复每个元素指定的次数10count(pattern)11startswith(pattern)和endswith(pattern)12find(pattern)返回pattern第一次出现的位置13swapcase变换大小写14islower()、isupper()和isnumeric()示例
import pandas as pd import numpy as np #cat(sep=pattern) s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t']) print('cat(sep=pattern\n)', s.str.cat(sep=' <> ')) #get_dummies() print('get_dummies()\n',s.str.get_dummies())输出结果:
cat(sep=pattern ) Tom <> William Rick <> John <> Alber@t get_dummies() Alber@t John Tom William Rick 0 0 0 1 0 1 0 0 0 1 2 0 1 0 0 3 1 0 0 0Pandas提供API 来定义其行为的某些方面:
get_option() - 查看参数值set_option() - 修改当前值reset_option() - 重置为默认值describe_option()option_context()常用参数表
No.paramdefaultdescription1display.max_rows60显示的最大行数2display.max_columns20显示的最大列数3display.min_rows10显示的最小行数4display.precision6显示十进制数的精度option_context(): 上下文管理器,用于临时设置语句中的选项。
import pandas as pd with pd.option_context('display.max_rows', 10): print(pd.get_option('display.max_rows')) print(pd.get_option('display.max_rows'))输出结果:
>>> 10 >>> 60Pandas目前支持三种类型的多轴索引:
No.expressiondescription访问方式1.loc()基于标签1)单个标量标签;2)标签列表;3)切片对象;4)一个布尔数组2.iloc()基于整数1)整数;2)整数列表;3)系列值3.ix()基于标签和整数属性访问:可以使用属性运算符.来选择列
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D']) print(df.A)输出结果:
0 0.058104 1 -1.791556 2 1.314500 3 -0.393179 4 -0.123881 5 -0.470718 6 0.404181 7 0.331097 Name: A, dtype: float64输出结果:
0 NaN 1 1.000000 2 0.500000 3 0.333333 4 0.250000 5 -0.200000 dtype: float64 A B 0 -0.265247 -0.457053 1 -1.457512 -0.789424 2 -0.001411 0.179580 3 1.064182 -2.083008 4 -1.826242 -0.793300 A B 0 NaN NaN 1 4.494926 0.727203 2 -0.999032 -1.227482 3 -755.284391 -12.599323 4 -2.716100 -0.619157 A B 0 NaN -0.766848 1 NaN -1.650977 2 NaN -9.941456 3 NaN 174.946646 4 NaN -2.333735 协方差 对象expressiondescriptionSeriess1.cov(s2)计算序列对象之间的协方差DataFramedf.cov()计算所有列之间的协方差 import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(8,6),columns=['a','b','c','d','e','f']) pd.set_option('display.precision',2) print(df) print(df.cov())输出结果:
a b c d e f 0 0.60 0.90 0.32 0.25 0.89 -0.40 1 0.03 -0.62 -0.26 -0.66 -0.57 -0.23 2 -0.80 -1.70 1.37 1.99 -0.39 1.40 3 1.32 -0.04 0.08 -0.26 -1.55 0.61 4 1.50 0.80 -0.15 -0.08 -0.04 2.41 5 0.39 -0.53 1.28 0.13 -0.29 0.06 6 1.33 0.12 0.89 -0.46 -0.09 2.04 7 -0.26 0.20 -1.01 0.82 -0.02 1.17 a b c d e f a 0.70 0.48 -5.45e-02 -0.51 -6.33e-02 2.63e-01 b 0.48 0.71 -3.56e-01 -0.37 2.52e-01 7.28e-02 c -0.05 -0.36 6.78e-01 0.19 -2.60e-03 5.00e-03 d -0.51 -0.37 1.93e-01 0.72 9.47e-02 1.41e-01 e -0.06 0.25 -2.60e-03 0.09 4.62e-01 -3.29e-02 f 0.26 0.07 5.00e-03 0.14 -3.29e-02 1.09e+00 相关性 对象expressiondesciptionSeriess1.corr(s2)两个系列之间的线性相关关系DataFramedf.corr()每个系列之间的相关关系method:
pearson - defaultspearmankendall 数据排名Series.rank(axis=0, method=‘average’, numeric_only=None, na_option=‘keep’, ascending=True, pct=False)
DataFrame.rank(axis=0, method=‘average’, numeric_only=None, na_option=‘keep’, ascending=True, pct=False)
pct: whether or not to display the returned rankings in percentile form为了处理数字数据,pandas提供了几个变体,如滚动、展开和指数移动窗口统计的权重。其中包括综合、均值、中位数、方差、协方差、相关性等。
For working with data, a number of window functions are provided for computing common window or rolling statistics. Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis. The rolling() and expanding() functions can be used directly from DataFrameGroupby objects.
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
window: the number of observations used for calculating the statistic.min_periods: Minimum number of observations in window required to have a value (otherwise result is NA), default 1center: set the labels at the center of the windowwin_type: Provide a window type. If None, all points are evenly weighted.axis: default 0closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or import pandas as pd import numpy as np df = pd.DataFrame( np.random.randn(10,4), index = pd.date_range(start='2020-10-01',periods=10), columns = ['a','b','c','d'] ) print(df) print('rolling object is:\n', df.rolling(window=3)) df1 = df.rolling(window=3).mean() df2 = df.mean() df3 = df.rolling(window=3, axis=1).sum() print(df1) print(df2) print(df3)输出结果
a b c d 2020-10-01 0.359021 0.975536 -0.410890 0.492394 2020-10-02 0.698119 -0.930699 0.280391 -0.556639 2020-10-03 -1.253823 -0.112911 -0.752055 -2.453324 2020-10-04 -1.598137 -0.279228 1.649673 -1.083345 2020-10-05 -0.194925 0.054404 0.329581 -0.240446 2020-10-06 0.345830 -0.431603 0.746752 -1.536757 2020-10-07 0.604672 -0.411267 -1.343458 0.029782 2020-10-08 -0.847642 -0.685667 0.274624 1.127408 2020-10-09 0.324988 -0.634140 0.426157 1.001843 2020-10-10 0.197737 1.788079 -1.134500 -0.051234 rolling object is: Rolling [window=3,center=False,axis=0] a b c d 2020-10-01 NaN NaN NaN NaN 2020-10-02 NaN NaN NaN NaN 2020-10-03 -0.065561 -0.022692 -0.294184 -0.839190 2020-10-04 -0.717947 -0.440946 0.392670 -1.364436 2020-10-05 -1.015628 -0.112578 0.409066 -1.259038 2020-10-06 -0.482411 -0.218809 0.908669 -0.953516 2020-10-07 0.251859 -0.262822 -0.089042 -0.582474 2020-10-08 0.034287 -0.509512 -0.107361 -0.126522 2020-10-09 0.027339 -0.577024 -0.214226 0.719678 2020-10-10 -0.108306 0.156091 -0.144573 0.692672 a -0.136416 b -0.066750 c 0.006628 d -0.327032 dtype: float64 a b c d 2020-10-01 NaN NaN 0.923668 1.057040 2020-10-02 NaN NaN 0.047811 -1.206947 2020-10-03 NaN NaN -2.118789 -3.318290 2020-10-04 NaN NaN -0.227692 0.287100 2020-10-05 NaN NaN 0.189060 0.143539 2020-10-06 NaN NaN 0.660979 -1.221608 2020-10-07 NaN NaN -1.150053 -1.724943 2020-10-08 NaN NaN -1.258685 0.716365 2020-10-09 NaN NaN 0.117005 0.793859 2020-10-10 NaN NaN 0.851316 0.602345DataFrame.expanding(min_periods=1, center=None, axis=0) Provide expanding transformations. Returns: a Window sub-classed for the particular operation
import pandas as pd import numpy as np df = pd.DataFrame( np.random.randn(10,4), index = pd.date_range(start='2020-10-01',periods=10), columns = ['a','b','c','d'] ) print(df) df1 = df.expanding(min_periods=3).sum() df2 = df.expanding(min_periods=2, axis=1).sum() print(df1) print(df2) a b c d 2020-10-01 0.303716 0.435884 -1.405003 -2.633900 2020-10-02 0.155709 -0.272056 -1.940426 0.539937 2020-10-03 -1.184132 -0.539030 0.496024 -0.224957 2020-10-04 1.709428 -1.639442 -0.509769 -0.643674 2020-10-05 -0.091363 -1.316263 0.863490 -1.228090 2020-10-06 0.140226 -0.439552 1.356944 -0.073533 2020-10-07 0.150266 -1.140866 -1.017271 1.922022 2020-10-08 1.184664 -1.242892 0.424909 2.071605 2020-10-09 0.416593 0.090358 -0.160895 0.172974 2020-10-10 -1.044040 -1.205647 0.274271 -1.460815 a b c d 2020-10-01 NaN NaN NaN NaN 2020-10-02 NaN NaN NaN NaN 2020-10-03 -0.724706 -0.375202 -2.849404 -2.318920 2020-10-04 0.984722 -2.014644 -3.359173 -2.962593 2020-10-05 0.893359 -3.330907 -2.495683 -4.190683 2020-10-06 1.033585 -3.770459 -1.138739 -4.264216 2020-10-07 1.183851 -4.911325 -2.156011 -2.342194 2020-10-08 2.368516 -6.154216 -1.731102 -0.270588 2020-10-09 2.785109 -6.063858 -1.891997 -0.097614 2020-10-10 1.741068 -7.269506 -1.617727 -1.558429 a b c d 2020-10-01 NaN 0.739601 -0.665402 -3.299302 2020-10-02 NaN -0.116347 -2.056773 -1.516836 2020-10-03 NaN -1.723162 -1.227137 -1.452094 2020-10-04 NaN 0.069986 -0.439783 -1.083457 2020-10-05 NaN -1.407625 -0.544136 -1.772225 2020-10-06 NaN -0.299326 1.057618 0.984086 2020-10-07 NaN -0.990600 -2.007871 -0.085849 2020-10-08 NaN -0.058227 0.366681 2.438287 2020-10-09 NaN 0.506951 0.346056 0.519030 2020-10-10 NaN -2.249688 -1.975417 -3.436232DataFrame.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None) Provide exponential weighted functions. Available EW functions: mean(), var(), corr(), cov() Exactly one parameter: com, span, halflife, or alpha must be provided.
com: Specify decay in terms of center of massspan: Specify decay in terms of spanhalflife: Specify decay in terms of half-lifealpha: Specify smoothing factor α directrlymin_periods: Minimum number of observations in window required to have a valueadjust: Divide by decaying adjustment factor in beginning periods to accout for imbalance in relative weightings (viewing EWMA as a moving average) ** When True, the EW function is calculated using weights ω = ( 1 − α ) i \omega=(1-\alpha)^i ω=(1−α)i. For example, the EW moving average of the series [ x 0 , x 1 , ⋅ ⋅ ⋅ , x t x_0, x_1, \cdot\cdot\cdot, x_t x0,x1,⋅⋅⋅,xt] would be: y t = x t + ( 1 − α ) x t − 1 + ( 1 − α ) 2 x t − 2 + ⋅ ⋅ ⋅ + ( 1 − α ) t x 0 1 + ( 1 − α ) + ( 1 − α ) 2 + ⋅ ⋅ ⋅ + ( 1 − α ) t y_t=\frac{x_t+(1-\alpha)x_{t-1}+(1-\alpha)^2x_{t-2}+\cdot\cdot\cdot+(1-\alpha)^tx_0}{1+(1-\alpha)+(1-\alpha)^2+\cdot\cdot\cdot+(1-\alpha)^t} yt=1+(1−α)+(1−α)2+⋅⋅⋅+(1−α)txt+(1−α)xt−1+(1−α)2xt−2+⋅⋅⋅+(1−α)tx0 ** When False, the exponentially weighted function is calculated recursively: y 0 = x 0 y_0=x_0 y0=x0 y t = ( 1 − α ) y t − 1 + α x t y_t=(1-\alpha)y_{t-1}+{\alpha}x_t yt=(1−α)yt−1+αxt import pandas as pd import numpy as np df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}) print(df) df1 = df.ewm(com=0.5).mean() print(df1) #Specifying times with a timedelta halflife when computing mean times = pd.date_range(start='2020-10-12', periods=5) df2 = df.ewm(halflife='4 days', times=pd.DatetimeIndex(times)).mean() print(df2)输出结果:
B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0 B 0 0.000000 1 0.750000 2 1.615385 3 1.615385 4 3.670213 B 0 0.000000 1 0.543214 2 1.114950 3 1.114950 4 2.144696输出结果:
A B C D 2020-10-14 -0.293505 0.245022 -1.020912 0.249647 2020-10-15 -0.370049 0.253555 0.431085 0.456073 2020-10-16 -0.526489 3.482717 -0.082019 0.169488 2020-10-17 1.310946 0.033661 -2.182524 -0.056167 2020-10-18 0.245885 0.027550 1.048386 0.091607 2020-10-19 1.241919 -0.629410 0.101962 0.744189 2020-10-20 -0.062255 -0.602408 0.768059 -0.771600 2020-10-21 -1.114848 -0.955441 1.308255 0.023233 2020-10-22 -0.059666 -0.860865 1.058340 -0.236879 2020-10-23 0.314197 0.995166 -0.703913 0.183485 Rolling [window=3,min_periods=1,center=False,axis=0]输出结果:
A B C D 2020-10-14 0.081500 0.472049 2.386231 -0.789650 2020-10-15 1.135092 0.288560 0.304266 0.600378 2020-10-16 -1.701276 0.851731 1.070779 0.278490 2020-10-17 1.021226 0.535126 -0.692457 0.386223 2020-10-18 0.118333 0.971667 -1.028922 1.159587 2020-10-19 -0.645824 -0.640767 0.427543 0.689348 2020-10-20 1.722740 -0.211914 -0.217482 -0.011830 2020-10-21 -0.394792 -0.047677 0.510992 0.741355 2020-10-22 -0.273230 -0.313764 0.712516 -0.489515 2020-10-23 0.389186 0.070262 -2.145313 -0.792609 A B C D 2020-10-14 NaN NaN NaN NaN 2020-10-15 NaN NaN NaN NaN 2020-10-16 -0.484684 1.612340 3.761276 0.089219 2020-10-17 0.455043 1.675417 0.682588 1.265092 2020-10-18 -0.561716 2.358524 -0.650599 1.824301 2020-10-19 0.493735 0.866026 -1.293835 2.235158 2020-10-20 1.195249 0.118986 -0.818860 1.837105 2020-10-21 0.682124 -0.900358 0.721053 1.418873 2020-10-22 1.054717 -0.573354 1.006026 0.240011 2020-10-23 -0.278837 -0.291179 -0.921806 -0.540769 2020-10-14 NaN 2020-10-15 NaN 2020-10-16 -0.484684 2020-10-17 0.455043 2020-10-18 -0.561716 2020-10-19 0.493735 2020-10-20 1.195249 2020-10-21 0.682124 2020-10-22 1.054717 2020-10-23 -0.278837 Freq: D, Name: A, dtype: float64 sum mean 2020-10-14 NaN NaN 2020-10-15 NaN NaN 2020-10-16 -0.484684 -0.161561 2020-10-17 0.455043 0.151681 2020-10-18 -0.561716 -0.187239 2020-10-19 0.493735 0.164578 2020-10-20 1.195249 0.398416 2020-10-21 0.682124 0.227375 2020-10-22 1.054717 0.351572 2020-10-23 -0.278837 -0.092946 A B sum mean sum mean 2020-10-14 NaN NaN NaN NaN 2020-10-15 NaN NaN NaN NaN 2020-10-16 -0.484684 -0.161561 1.612340 0.537447 2020-10-17 0.455043 0.151681 1.675417 0.558472 2020-10-18 -0.561716 -0.187239 2.358524 0.786175 2020-10-19 0.493735 0.164578 0.866026 0.288675 2020-10-20 1.195249 0.398416 0.118986 0.039662 2020-10-21 0.682124 0.227375 -0.900358 -0.300119 2020-10-22 1.054717 0.351572 -0.573354 -0.191118 2020-10-23 -0.278837 -0.092946 -0.291179 -0.097060 A B 2020-10-14 NaN NaN 2020-10-15 NaN NaN 2020-10-16 -0.161561 1.612340 2020-10-17 0.151681 1.675417 2020-10-18 -0.187239 2.358524 2020-10-19 0.164578 0.866026 2020-10-20 0.398416 0.118986 2020-10-21 0.227375 -0.900358 2020-10-22 0.351572 -0.573354 2020-10-23 -0.092946 -0.291179在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作:
聚合转换 - 执行一些特定于组的操作过滤 import pandas as pd ipl_data = { 'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1,2,2,3,3,4,1,1,2,4,1,2], 'Year': [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017], 'Points': [876, 798, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690] } df = pd.DataFrame(ipl_data) print(df) #查看分组 print(df.groupby('Team').groups) #按多列分组 print(df.groupby(['Team', 'Year']).groups) #迭代遍历分组 grouped = df.groupby('Year') for name, group in grouped: print(name) print(group)输出结果:
Team Rank Year Points 0 Riders 1 2014 876 1 Riders 2 2015 798 2 Devils 2 2014 863 3 Devils 3 2015 673 4 Kings 3 2014 741 5 Kings 4 2015 812 6 Kings 1 2016 756 7 Kings 1 2017 788 8 Riders 2 2016 694 9 Royals 4 2014 701 10 Royals 1 2015 804 11 Riders 2 2017 690 {'Devils': [2, 3], 'Kings': [4, 5, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals': [9, 10]} {('Devils', 2014): [2], ('Devils', 2015): [3], ('Kings', 2014): [4], ('Kings', 2015): [5], ('Kings', 2016): [6], ('Kings', 2017): [7], ('Riders', 2014): [0], ('Riders', 2015): [1], ('Riders', 2016): [8], ('Riders', 2017): [11], ('Royals', 2014): [9], ('Royals', 2015): [10]} #迭代遍历分组 2014 Team Rank Year Points 0 Riders 1 2014 876 2 Devils 2 2014 863 4 Kings 3 2014 741 9 Royals 4 2014 701 2015 Team Rank Year Points 1 Riders 2 2015 798 3 Devils 3 2015 673 5 Kings 4 2015 812 10 Royals 1 2015 804 2016 Team Rank Year Points 6 Kings 1 2016 756 8 Riders 2 2016 694 2017 Team Rank Year Points 7 Kings 1 2017 788 11 Riders 2 2017 690聚合、转换、过滤
#聚合 print(df.groupby('Team')['Points'].agg([np.sum, np.mean, np.std])) #转换 grouped = df.groupby('Team') score = lambda x: (x - x.mean())/x.std()*10 print(grouped.apply(lambda x: (x - x.mean())/x.std()*10)) print(grouped.transform(score))输出结果:
#聚合 sum mean std Team Devils 1536 768.00 134.350288 Kings 3097 774.25 31.899582 Riders 3058 764.50 89.582364 Royals 1505 752.50 72.831998 #转换 Points Rank Team Year 0 12.446646 -15.000000 NaN -11.618950 1 3.739575 5.000000 NaN -3.872983 2 7.071068 -7.071068 NaN -7.071068 3 -7.071068 7.071068 NaN 7.071068 4 -10.423334 5.000000 NaN -11.618950 5 11.834011 11.666667 NaN -3.872983 6 -5.721078 -8.333333 NaN 3.872983 7 4.310401 -8.333333 NaN 11.618950 8 -7.869853 5.000000 NaN 3.872983 9 -7.071068 7.071068 NaN -7.071068 10 7.071068 -7.071068 NaN 7.071068 11 -8.316369 5.000000 NaN 11.618950 Rank Year Points 0 -15.000000 -11.618950 12.446646 1 5.000000 -3.872983 3.739575 2 -7.071068 -7.071068 7.071068 3 7.071068 7.071068 -7.071068 4 5.000000 -11.618950 -10.423334 5 11.666667 -3.872983 11.834011 6 -8.333333 3.872983 -5.721078 7 -8.333333 11.618950 4.310401 8 5.000000 3.872983 -7.869853 9 7.071068 -7.071068 -7.071068 10 -7.071068 7.071068 7.071068 11 5.000000 11.618950 -8.316369DataFrame.merge(right, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=’_x’,’_y’, copy=True, indicator=False, validate=None) Merge DataFrame or named Series objects with a database-style join.
right: Object to merge with how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default inner on: Column or index level names to join on. left_on: Column or index level names to join on in the left DataFrame. right_on: Column or index level names to join on in the right DataFrame. left_index: Use the index from the left DataFrame as the join keys; right_index: Use the index from the right DataFrame as the join key. sort: Sort the join keys lexicographically in the result DataFrame. suffixes: A length-2 sequesnce where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. copy: If False, avoid copy if possible indicator: If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column will have a Categorical type with the value of “left_only” for ovservations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appear in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames. validate: If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets. “one_to_many” or “1:m”: check if merge keys are unique in left dataset. “many_to_one” or “m:1”: check if merge keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but does not result in checks. concat()pandas.concat() Concatenate pandas objects along a particular axis with optional set logic along the other axes.
objs: a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected. Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.axis join: “inner” or “outer”, default “outer” ignore_index: If True, the resulting axis wil be labeled 0,…,n-1 keys: if multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level. levels: Specific levels to use for constructing a MultiIndex. names: Names for the levels in the resulting hierarchical index. verify_integrity: check whether the new concatenated axis contains duplicates. copy: If False, do not copy data unnecessarily
连接的一个有用的快捷方式是在Series和DataFrame实例的append方法,相当于concat按axis=0连接。例如:
df_new = df1.append([df2, df3, df4])时间差(Timedelta)是时间上的差异,以不同的单位来表示。例如:日,小时,分钟,秒。它们可以是正值,也可以是负值。 可以使用各种参数创建Timedelta对象。
import pandas as pd #使用字符串创建timedelta对象 timediff = pd.Timedelta('2 days 2 hours 15 minutes 30 seconds') print(timediff) #传递整数与指定单位来创建Timedelta对象 timediff = pd.Timedelta(6, unit='h') print(timediff) #周,天,小时,分钟,秒,毫秒,微秒,纳秒的数据偏移也可用于构建。 timediff = pd.Timedelta(days=2) print(timediff)运算操作:直接在时间戳上加/减时间差,得到新的时间戳
import pandas as pd s = pd.Series(pd.date_range('2020-10-19', periods=3, freq='D')) td = pd.Series([pd.Timedelta(days=i) for i in range(3)]) df = pd.DataFrame(dict(A = s, B=td)) df['C'] = df['A']+df['B'] print(df)输出结果:
A B C 0 2020-10-19 0 days 2020-10-19 1 2020-10-20 1 days 2020-10-21 2 2020-10-21 2 days 2020-10-23分类是Pandas数据类型。分类变量只能采用有限的数量,而且通常是固定的数量。除了固定长度,分类数据可能有顺序,但不能执行数字操作。 分类数据类型在一下情况非常有用:
一个字符串变量,只包含几个不同的值。将这样的字符串变量转换为分类变量将会节省一些内存。变量的词汇顺序与逻辑顺序不同(如one, two, three),通过转换为分类并指定类别上的顺序,排序和最小/最大将使用逻辑顺序。作为其他python库的一个信号,这个列应该被当做一个分类变量(例如,使用合适的统计方法或plot类型) # 对象创建 # 方法1:指定dtype="category" import pandas as pd s = pd.Series(["a","b","c","a"], dtype="category") print(s) # 方法2:使用标准Pandas分类构造函数 # pandas.Categorical(values, categories, ordered) cat = pd.Categorical(['a','b','c','a','c']) print(cat) # 添加categories参数并排序 cat2 = pd.Categorical(['a','b','c','a','c'], ['c','b','a'], ordered=True) print(cat2) df = pd.DataFrame({ 'col1': cat2, 'col2': pd.date_range(start='2020-10-19',periods=5) }) print(df) df_sorted=df.sort_values(['col1','col2'],ignore_index=True) print(df_sorted)输出结果:
0 a 1 b 2 c 3 a dtype: category Categories (3, object): ['a', 'b', 'c'] ['a', 'b', 'c', 'a', 'c'] Categories (3, object): ['a', 'b', 'c'] ['a', 'b', 'c', 'a', 'c'] Categories (3, object): ['c' < 'b' < 'a'] col1 col2 0 a 2020-10-19 1 b 2020-10-20 2 c 2020-10-21 3 a 2020-10-22 4 c 2020-10-23 col1 col2 0 c 2020-10-21 1 c 2020-10-23 2 b 2020-10-20 3 a 2020-10-19 4 a 2020-10-22category变量上的其他操作
expressiondescription.describe()数据描述obj.cat.categories获取类别属性obj.ordered获取对象的顺序obj.cat.categories=[]通过将新值分配给series.cat.categories属性来重命名类别obj.cat.add_categories([])追加新的类别obj.cat.remove_categories([])删除类别基本绘图 Series和DataFrame上的绘图功能使用matplotlib库的plot()方法。
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(np.random.randn(10,4), index=pd.date_range('2020-10-20', periods=10), columns=list('ABCD')) df.plot() plt.show()输出结果:
绘图方法允许除默认线图之外的少数绘图样式,这些方法可以作为plot()的kind关键字参数提供。这些包括:
bar或barh - 条形图/水平条形图hist - 直方图boxplot - 盒型图area - 面积图scatter - 散点图堆积条形图
stacked=True import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d']) df.plot.bar(stacked=True) plt.show()输出结果:
水平条形图
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d']) df.plot.barh(stacked=True) plt.show()输出结果:
直方图
使用plot.hist()方法绘制直方图,可以指定bins的数量值。
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame({ 'a': np.random.randn(1000)+1, 'b': np.random.randn(1000), 'c': np.random.randn(1000)-1} ) df.plot.hist(bins=20) plt.show()输出结果:
绘制不同系列的直方图
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame({ 'a': np.random.randn(1000)+1, 'b': np.random.randn(1000), 'c': np.random.randn(1000)-1} ) df.hist(bins=20) plt.show()输出结果:
箱图
Boxplot可以绘制调用Series.box.plot()和DataFrame.box.plot()或DataFrame.boxplot()来可视化每列中值的分布。
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDE')) print(df) df.plot.box() plt.show()输出结果:
区域图
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(np.abs(np.random.randn(10,4)), columns=list('ABCD')) print(df) df.plot.area() plt.show()输出结果:
散点图
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(np.random.rand(50,4), columns=list('abcd')) print(df) df.plot.scatter(x='a', y='b') plt.show()输出结果:
Pie Chart
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.DataFrame(pd.Series([5,10,15,20],index=list('abcd')), columns=['x']) print(df) df.plot.pie(subplots=True) plt.show()输出结果:
Pandas I/O API是一套像pg.read_csv()一样返回Pandas对象的顶级读取器函数。 读取文本文件(或平面文件)的两个主要功能是read_csv()和read_table()。他们都使用相同的解析代码来智能地将表格数据转换为DataFrame对象。
形式1: pandas.read_csv(filepath_or_buffer, sep=’,’, delimiter=None, header=‘infer’, names=None, index_col=None, usecols=None)
形式2: pandas.read_csv(filepath_or_buffer, sep=’\t’, delimiter=None, header=‘infer’, names=None, index_col=None, usecols=None)
基础操作 import pandas as pd import numpy as np # 基础操作 df = pd.read_csv('temp.csv') print(df) print(df.dtypes)输出:
S.No Name Age City Salary 0 1 Tom 28 Toronto 20000 1 2 Lee 32 HongKong 3000 2 3 Steven 43 Bay Area 8300 3 4 Ram 38 Hyderabad 3900 S.No int64 Name object Age int64 City object Salary int64 dtype: object 通过index_col参数(传递列表)指定索引列 # 通过index_col参数(传递列表)指定索引列 df = pd.read_csv('temp.csv', sep=',', index_col=['S.No']) print(df)输出:
Name Age City Salary S.No 1 Tom 28 Toronto 20000 2 Lee 32 HongKong 3000 3 Steven 43 Bay Area 8300 4 Ram 38 Hyderabad 3900 通过dtype参数(传递字典)转换数据类型 # 通过dtype参数(传递字典)转换数据类型 df = pd.read_csv('temp.csv', dtype={'Salary': np.float64}) print(df.dtypes)输出:
S.No int64 Name object Age int64 City object Salary float64 dtype: object 通过skiprows参数跳过读取的行数 # 通过skiprows参数跳过读取的行数 df = pd.read_csv('temp.csv', skiprows=1,names=list('abcde')) print(df)输出:
a b c d e 0 1 Tom 28 Toronto 20000 1 2 Lee 32 HongKong 3000 2 3 Steven 43 Bay Area 8300 3 4 Ram 38 Hyderabad 3900 功能是read_csv()和read_table()。他们都使用相同的解析代码来智能地将表格数据转换为DataFrame对象。 > 形式1: > *pandas*.**read_csv**(*filepath_or_buffer*, *sep*=',', *delimiter*=None, *header*='infer', *names*=None, *index_col*=None, *usecols*=None) > 形式2: > *pandas*.**read_csv**(*filepath_or_buffer*, *sep*='\t', *delimiter*=None, *header*='infer', *names*=None, *index_col*=None, *usecols*=None) 1. 基础操作 ```python import pandas as pd import numpy as np # 基础操作 df = pd.read_csv('temp.csv') print(df) print(df.dtypes)输出:
S.No Name Age City Salary 0 1 Tom 28 Toronto 20000 1 2 Lee 32 HongKong 3000 2 3 Steven 43 Bay Area 8300 3 4 Ram 38 Hyderabad 3900 S.No int64 Name object Age int64 City object Salary int64 dtype: object 通过index_col参数(传递列表)指定索引列 # 通过index_col参数(传递列表)指定索引列 df = pd.read_csv('temp.csv', sep=',', index_col=['S.No']) print(df)输出:
Name Age City Salary S.No 1 Tom 28 Toronto 20000 2 Lee 32 HongKong 3000 3 Steven 43 Bay Area 8300 4 Ram 38 Hyderabad 3900 通过dtype参数(传递字典)转换数据类型 # 通过dtype参数(传递字典)转换数据类型 df = pd.read_csv('temp.csv', dtype={'Salary': np.float64}) print(df.dtypes)输出:
S.No int64 Name object Age int64 City object Salary float64 dtype: object 通过skiprows参数跳过读取的行数 # 通过skiprows参数跳过读取的行数 df = pd.read_csv('temp.csv', skiprows=1,names=list('abcde')) print(df)输出:
a b c d e 0 1 Tom 28 Toronto 20000 1 2 Lee 32 HongKong 3000 2 3 Steven 43 Bay Area 8300 3 4 Ram 38 Hyderabad 3900Source:易百教程