There are two core objects in pandas: the DataFrame and the Series.
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}, index=['Product A', 'Product B']) pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')Reading data
wine_reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0) wine_reviews.head()index_col=0:使用数据集自带索引
wine_reviews.shapeIndex-based selection
reviews.iloc[:, 0] reviews.iloc[[0, 1, 2], 0] reviews.iloc[-5:] # 最后五个元素Label-based selection
reviews.loc[0, 'country'] reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']] df.loc['Apples':'Potatoes']注意iloc是左闭右开,loc是左闭右闭
Manipulating the index
reviews.set_index("title")Conditional selection
reviews.country == 'Italy' # This operation produced a Series of True/False booleans based on the country of each record. reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] reviews.loc[reviews.country.isin(['Italy', 'France'])] reviews.loc[reviews.price.notnull()] # 常用于筛选缺失值,还有一个isnull()Assigning data
reviews['critic'] = 'everyone' reviews['index_backwards'] = range(len(reviews), 0, -1) 0 129971 1 129970 ... 129969 2 129970 1 Name: index_backwards, Length: 129971, dtype: int64Summary functions
reviews.points.describe() count 129971.000000 mean 88.447138 ... 75% 91.000000 max 100.000000 Name: points, Length: 8, dtype: float64 reviews.taster_name.describe() count 103727 unique 19 top Roger Voss freq 25514 Name: taster_name, dtype: object reviews.points.mean() reviews.taster_name.unique() reviews.taster_name.value_counts()map和apply map返回series,apply返回dataframe
reviews.points.map(lambda p: p - review_points_mean) def remean_points(row): row.points = row.points - review_points_mean return row reviews.apply(remean_points, axis='columns')有时也可以直接操作
reviews.points - review_points_mean reviews.country + " - " + reviews.region_1Groupwise analysis
reviews.groupby('points').price.min() reviews.groupby('winery').apply(lambda df: df.title.iloc[0]) reviews.groupby(['country']).price.agg([len, min, max])Multi-indexes
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len]) countries_reviewed.reset_index() #复位为单索引结构Sorting sort_values() defaults to an ascending sort, where the lowest values go first.
countries_reviewed.sort_values(by='len', ascending=False) countries_reviewed.sort_values(by=['country', 'len'])To sort by index values, use the companion method sort_index(). This method has the same arguments and default order:
countries_reviewed.sort_index()The data type for a column in a DataFrame or a Series is known as the dtype.
reviews.price.dtype reviews.dtypes格式转换
reviews.points.astype('float64')Missing data
reviews[pd.isnull(reviews.country)] reviews.region_2.fillna("Unknown") reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")