DataFrame是什么
DataFrame是一个以命名列方式组织的分布式数据集。在概念上,它跟关系型数据库中的一张表或者1个Python(或者R)中的data frame一样,但是比他们更优化。DataFrame可以根据结构化的数据文件、hive表、外部数据库或者已经存在的RDD构造。
DataFrame的创建
从字典创建dataFrame
import pandas
as pd
data
= {
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame
= pd
.DataFrame
(data
)
pop state year
0 1.5 Ohio
2000
1 1.7 Ohio
2001
2 3.6 Ohio
2002
3 2.4 Nevada
2001
4 2.9 Nevada
2002
通过读取excel创建DataFrame
import pandas
as pd
df
= pd
.read_csv
(file_name
,sep
=",",encoding
="utf8")
df
.to_csv
(file_name
)
参数描述
header默认第一行为columns,如果指定header=None,则表明没有索引行,第一行就是数据index_col默认作为索引的为第一列,可以设为index_col为-1,表明没有索引列nrows表明读取的行数sep或delimiter分隔符,read_csv默认是逗号,而read_table默认是制表符\tencoding编码格式
数据库查询结果转化为DataFrame
以PostgreSQL数据库查询为例
import psycopg2
conn
= psycopg2
.connect
(database
="ietl", user
="kiwi", password
="xxxxxx", host
="192.168.50.5", port
="5432")
cursor
= conn
.cursor
()
cursor
.execute
(sql
)
datas
= cursor
.fetchall
()
df
= pd
.DataFrame
(datas
)
DataFrame的操作
截取多行
import pandas
as pd
data
= {
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame
= pd
.DataFrame
(data
)
pop state year
0 1.5 Ohio
2000
1 1.7 Ohio
2001
2 3.6 Ohio
2002
3 2.4 Nevada
2001
4 2.9 Nevada
2002
df1
= frame
[2:]
state year pop
2 Ohio
2002 3.6
3 Nevada
2001 2.4
4 Nevada
2002 2.9
当然,在选取数据的时候,我们还可以根据逻辑条件来选取
df2
= frame
[frame
['pop']>2]
state year pop
2 Ohio
2002 3.6
3 Nevada
2001 2.4
4 Nevada
2002 2.9
截取多列
df3
= frame
[['year','pop']]
print(df3
)
year pop
0 2000 1.5
1 2001 1.7
2 2002 3.6
3 2001 2.4
4 2002 2.9
修改数据
data
= {
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame2
= pd
.DataFrame
(data
,index
=['one','two','three','four','five'],columns
=['year','state','pop','debt'])
print(frame2
)
year state pop debt
one
2000 Ohio
1.5 NaN
two
2001 Ohio
1.7 NaN
three
2002 Ohio
3.6 NaN
four
2001 Nevada
2.4 NaN
five
2002 Nevada
2.9 NaN
修改一列
frame2
['debt']=16.5
print(frame2
)
year state pop debt
one
2000 Ohio
1.5 16.5
two
2001 Ohio
1.7 16.5
three
2002 Ohio
3.6 16.5
four
2001 Nevada
2.4 16.5
five
2002 Nevada
2.9 16.5
也可以使用一个列表来修改,不过要保证列表的长度与DataFrame长度相同
frame2
['debt']= np
.arange
(5)
print(frame2
)
year state pop debt
one
2000 Ohio
1.5 0
two
2001 Ohio
1.7 1
three
2002 Ohio
3.6 2
four
2001 Nevada
2.4 3
five
2002 Nevada
2.9 4
可以使用一个Series,此时会根据索引进行精确匹配
val
= pd
.Series
([-1.2,-1.5,-1.7],index
=['two','four','five'])
frame2
['debt'] = val
print(frame2
)
year state pop debt
one
2000 Ohio
1.5 NaN
two
2001 Ohio
1.7 -1.2
three
2002 Ohio
3.6 NaN
four
2001 Nevada
2.4 -1.5
five
2002 Nevada
2.9 -1.7
重设索引
data
= {
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame2
= pd
.DataFrame
(data
,columns
=['year','state','pop','debt'])
print(frame2
)
year state pop debt
0 2000 Ohio
1.5 NaN
1 2001 Ohio
1.7 NaN
2 2002 Ohio
3.6 NaN
3 2001 Nevada
2.4 NaN
4 2002 Nevada
2.9 NaN
frame2
.set_index
('year',inplace
=True)
print(frame2
)
state pop debt
year
2000 Ohio
1.5 NaN
2001 Ohio
1.7 NaN
2002 Ohio
3.6 NaN
2001 Nevada
2.4 NaN
2002 Nevada
2.9 NaN
修改列名
方法
1:
df
.rename
(columns
={0:'acct_id', 1:'acct_name', 2:'id_card'}, inplace
= True)
方法
2:
df
.columns
= ['账户ID','账户名称','身份证号']
删除行
df1
= frame2
.drop
(2000,axis
=0)
state pop debt
year
2001 Ohio
1.7 NaN
2002 Ohio
3.6 NaN
2001 Nevada
2.4 NaN
2002 Nevada
2.9 NaN
删除列
df2
= frame2
.drop
(['debt'],axis
=1)
state pop
year
2000 Ohio
1.5
2001 Ohio
1.7
2002 Ohio
3.6
2001 Nevada
2.4
2002 Nevada
2.9
排序
按照index进行排序
构造输出
data
= {
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame2
= pd
.DataFrame
(data
,columns
=['year','state','pop','debt'])
frame2
.set_index
('year',inplace
=True)
print(frame2
)
输出
state pop
year
2000 Ohio
1.5
2001 Ohio
1.7
2002 Ohio
3.6
2001 Nevada
2.4
2002 Nevada
2.9
df1
= frame2
.sort_index
()
输出
state pop
year
2000 Ohio
1.5
2001 Ohio
1.7
2001 Nevada
2.4
2002 Ohio
3.6
2002 Nevada
2.9
按照列进行排序
df1
= frame2
.sort_index
(1,ascending
=True)
输出
pop state
year
2000 1.5 Ohio
2001 1.7 Ohio
2002 3.6 Ohio
2001 2.4 Nevada
2002 2.9 Nevada
DataFrame也可以按照值进行排序:
df2
= frame2
.sort_values
(by
=['pop','state'])
输出
state pop
year
2000 Ohio
1.5
2001 Ohio
1.7
2001 Nevada
2.4
2002 Nevada
2.9
2002 Ohio
3.6
合并列
df2
= query_sql
(sql2
)
df2
.rename
(columns
={0:'acct_id', 1:'sec_code', 2:'hold_vol',3:'hold_amt',4:'rank'}, inplace
= True)
df3
= query_sql
(sql3
)
df3
.rename
(columns
={0:'sec_code', 1:'sec_name', 2:'industry'}, inplace
= True)
当两边合并字段不同时,可以使用left_on和right_on参数设置合并字段。当然这里合并字段都是key所以left_on和right_on参数值都是key。
df4
= pd
.merge
(df2
[['rank','sec_code','acct_id','hold_vol','hold_amt']],df3
[['sec_code','sec_name']],left_on
='sec_code',right_on
='sec_code')
去重
df4
.drop_duplicates
(inplace
=True)