Python数据处理(DataFrame)

it2023-01-19  58

DataFrame是什么

DataFrame是一个以命名列方式组织的分布式数据集。在概念上,它跟关系型数据库中的一张表或者1个Python(或者R)中的data frame一样,但是比他们更优化。DataFrame可以根据结构化的数据文件、hive表、外部数据库或者已经存在的RDD构造。

DataFrame的创建

从字典创建dataFrame import pandas as pd data = { 'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9] } frame = pd.DataFrame(data) #输出 pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 通过读取excel创建DataFrame import pandas as pd df = pd.read_csv(file_name,sep=",",encoding="utf8") #dataframe写入到excel文件 df.to_csv(file_name) 参数描述header默认第一行为columns,如果指定header=None,则表明没有索引行,第一行就是数据index_col默认作为索引的为第一列,可以设为index_col为-1,表明没有索引列nrows表明读取的行数sep或delimiter分隔符,read_csv默认是逗号,而read_table默认是制表符\tencoding编码格式 数据库查询结果转化为DataFrame

以PostgreSQL数据库查询为例

import psycopg2 conn = psycopg2.connect(database="ietl", user="kiwi", password="xxxxxx", host="192.168.50.5", port="5432") cursor = conn.cursor() cursor.execute(sql) datas = cursor.fetchall() df = pd.DataFrame(datas)

DataFrame的操作

截取多行

import pandas as pd data = { 'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9] } frame = pd.DataFrame(data) #输出 pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 df1 = frame[2:] #输出 state year pop 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9

当然,在选取数据的时候,我们还可以根据逻辑条件来选取

df2 = frame[frame['pop']>2] #输出 state year pop 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9

截取多列

df3 = frame[['year','pop']] print(df3) #输出 year pop 0 2000 1.5 1 2001 1.7 2 2002 3.6 3 2001 2.4 4 2002 2.9

修改数据

data = { 'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9] } frame2 = pd.DataFrame(data,index=['one','two','three','four','five'],columns=['year','state','pop','debt']) print(frame2) #输出 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN

修改一列

frame2['debt']=16.5 print(frame2) #输出 year state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5

也可以使用一个列表来修改,不过要保证列表的长度与DataFrame长度相同

frame2['debt']= np.arange(5) print(frame2) #输出 year state pop debt one 2000 Ohio 1.5 0 two 2001 Ohio 1.7 1 three 2002 Ohio 3.6 2 four 2001 Nevada 2.4 3 five 2002 Nevada 2.9 4

可以使用一个Series,此时会根据索引进行精确匹配

val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five']) frame2['debt'] = val print(frame2) #输出 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7

重设索引

data = { 'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9] } frame2 = pd.DataFrame(data,columns=['year','state','pop','debt']) print(frame2) #输出 year state pop debt 0 2000 Ohio 1.5 NaN 1 2001 Ohio 1.7 NaN 2 2002 Ohio 3.6 NaN 3 2001 Nevada 2.4 NaN 4 2002 Nevada 2.9 NaN frame2.set_index('year',inplace=True) print(frame2) #输出 state pop debt year 2000 Ohio 1.5 NaN 2001 Ohio 1.7 NaN 2002 Ohio 3.6 NaN 2001 Nevada 2.4 NaN 2002 Nevada 2.9 NaN

修改列名

方法1: df.rename(columns={0:'acct_id', 1:'acct_name', 2:'id_card'}, inplace = True) 方法2: df.columns = ['账户ID','账户名称','身份证号']

删除行

df1 = frame2.drop(2000,axis=0) #输出 state pop debt year 2001 Ohio 1.7 NaN 2002 Ohio 3.6 NaN 2001 Nevada 2.4 NaN 2002 Nevada 2.9 NaN

删除列

df2 = frame2.drop(['debt'],axis=1) #输出 state pop year 2000 Ohio 1.5 2001 Ohio 1.7 2002 Ohio 3.6 2001 Nevada 2.4 2002 Nevada 2.9

排序

按照index进行排序 构造输出 data = { 'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9] } frame2 = pd.DataFrame(data,columns=['year','state','pop','debt']) frame2.set_index('year',inplace=True) print(frame2) 输出 state pop year 2000 Ohio 1.5 2001 Ohio 1.7 2002 Ohio 3.6 2001 Nevada 2.4 2002 Nevada 2.9 df1 = frame2.sort_index() 输出 state pop year 2000 Ohio 1.5 2001 Ohio 1.7 2001 Nevada 2.4 2002 Ohio 3.6 2002 Nevada 2.9 按照列进行排序 df1 = frame2.sort_index(1,ascending=True) 输出 pop state year 2000 1.5 Ohio 2001 1.7 Ohio 2002 3.6 Ohio 2001 2.4 Nevada 2002 2.9 Nevada DataFrame也可以按照值进行排序: df2 = frame2.sort_values(by=['pop','state']) 输出 state pop year 2000 Ohio 1.5 2001 Ohio 1.7 2001 Nevada 2.4 2002 Nevada 2.9 2002 Ohio 3.6

合并列

df2 = query_sql(sql2) df2.rename(columns={0:'acct_id', 1:'sec_code', 2:'hold_vol',3:'hold_amt',4:'rank'}, inplace = True) df3 = query_sql(sql3) df3.rename(columns={0:'sec_code', 1:'sec_name', 2:'industry'}, inplace = True) 当两边合并字段不同时,可以使用left_on和right_on参数设置合并字段。当然这里合并字段都是key所以left_on和right_on参数值都是key。 df4 = pd.merge(df2[['rank','sec_code','acct_id','hold_vol','hold_amt']],df3[['sec_code','sec_name']],left_on='sec_code',right_on='sec_code')

去重

df4.drop_duplicates(inplace=True)
最新回复(0)