Pandas

it2024-10-16 37

Creating, Reading and Writing

There are two core objects in pandas: the DataFrame and the Series.

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B']) pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Reading data

wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0) wine_reviews.head()

index_col=0：使用数据集自带索引

wine_reviews.shape

Indexing, Selecting & Assigning

reviews.country reviews['country']

Index-based selection

reviews.iloc[:, 0] reviews.iloc[[0, 1, 2], 0] reviews.iloc[-5:] # 最后五个元素

Label-based selection

reviews.loc[0, 'country'] reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']] df.loc['Apples':'Potatoes']

注意iloc是左闭右开，loc是左闭右闭

Manipulating the index

reviews.set_index("title")

Conditional selection

reviews.country == 'Italy' # This operation produced a Series of True/False booleans based on the country of each record. reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] reviews.loc[reviews.country.isin(['Italy', 'France'])] reviews.loc[reviews.price.notnull()] # 常用于筛选缺失值，还有一个isnull()

Assigning data

reviews['critic'] = 'everyone' reviews['index_backwards'] = range(len(reviews), 0, -1) 0 129971 1 129970 ... 129969 2 129970 1 Name: index_backwards, Length: 129971, dtype: int64

统计信息与map操作

Summary functions

reviews.points.describe() count 129971.000000 mean 88.447138 ... 75% 91.000000 max 100.000000 Name: points, Length: 8, dtype: float64 reviews.taster_name.describe() count 103727 unique 19 top Roger Voss freq 25514 Name: taster_name, dtype: object reviews.points.mean() reviews.taster_name.unique() reviews.taster_name.value_counts()

map和apply map返回series，apply返回dataframe

reviews.points.map(lambda p: p - review_points_mean) def remean_points(row): row.points = row.points - review_points_mean return row reviews.apply(remean_points, axis='columns')

有时也可以直接操作

reviews.points - review_points_mean reviews.country + " - " + reviews.region_1

Grouping and Sorting

Groupwise analysis

reviews.groupby('points').price.min() reviews.groupby('winery').apply(lambda df: df.title.iloc[0]) reviews.groupby(['country']).price.agg([len, min, max])

Multi-indexes

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len]) countries_reviewed.reset_index() #复位为单索引结构

Sorting sort_values() defaults to an ascending sort, where the lowest values go first.

countries_reviewed.sort_values(by='len', ascending=False) countries_reviewed.sort_values(by=['country', 'len'])

To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:

countries_reviewed.sort_index()

Data Types and Missing Values

The data type for a column in a DataFrame or a Series is known as the dtype.

reviews.price.dtype reviews.dtypes

格式转换

reviews.points.astype('float64')

Missing data

reviews[pd.isnull(reviews.country)] reviews.region_2.fillna("Unknown") reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

最新回复(0)