【缺失值填补】面板数据的缺失值补全方法

it2025-04-08  20

对于两端缺失数据使用前后填充,对于中间数据使用线性插值(可替换)

import numpy as np import pandas as pd from scipy.interpolate import interp1d df = pd.read_excel('data.xlsx') df['城市'].fillna(method = 'ffill',inplace = True) df.drop('city',1,inplace=True) df[df['城市'] == '晋中市'] 城市yearpm2.5(达尔豪斯)lnpm2.5(达尔豪斯)第二产业增加值占GDP比重(%)人均地区生产总值(元)地区生产总值(万元)年末人口总数(万人)行政区域土地面积(全市)能源效率地方财政一般预算内支出科学支出年末实有城市道路面积(万平方米)行政区域土地面积(市辖区)210晋中市200430.1173603.40510250.408300.02539148.0306.1216404.08.3377563050003503701327.0211晋中市200538.5810123.65276040.4610772.03310171.0306.6316404.08.8766203944965323881311.0212晋中市200645.3245753.81384942.3112491.03856794.0310.0016404.09.2150316640544974211311.0213晋中市200736.5783863.59945841.9415157.04701976.0313.8816404.07.98984472106458665211311.0214晋中市200834.3744823.53731438.7118219.05678066.0317.9016404.06.54799985679575695851311.0215晋中市200938.0597343.63915733.4420335.06368106.0320.0016404.07.236680103087384955921311.0216晋中市201049.9638073.91129936.4624275.07638366.0320.9616392.07.2218491205394101196161318.0217晋中市201177.1098224.34523140.1927300.08902376.0323.3016392.08.4641271541989119396161318.0218晋中市201257.9390104.05939140.6430093.09865596.0325.0016392.07.1952881787371140587691318.0219晋中市201354.5319283.99878639.6930977.010222281.0330.5016392.06.76153221363331728710101318.0220晋中市201445.4842013.81736539.6931434.010413000.0330.5016392.06.7615322168543191768501318.0221晋中市201542.3429213.74580131.3931434.010461155.0329.6616392.0NaN2411647190369651318.0222晋中市201635.9989733.58349030.0932646.010911041.0332.0016444.0NaN24716991513410951311.0223晋中市201735.1702593.56020131.9538274.012849233.0332.0016392.07.34006227892312456901318.0224晋中市201834.4520573.53956932.1442910.014476039.0334.0016444.07.55080032263682287114511311.0 df.isna().sum() 城市 0 year 0 pm2.5(达尔豪斯) 0 lnpm2.5(达尔豪斯) 15 第二产业增加值占GDP比重(%) 12 人均地区生产总值(元) 59 地区生产总值(万元) 34 年末人口总数(万人) 0 行政区域土地面积(全市) 0 能源效率 106 地方财政一般预算内支出 0 科学支出 0 年末实有城市道路面积(万平方米) 0 行政区域土地面积(市辖区) 0 dtype: int64 def chazhi_(data,col,method = 'linear',thresdhold = 0.9): ''' 创建缺失值补全函数,针对每一个分组 data:传入的每一个分组 method:使用插值的方法, thresdhold:插值的门槛,即缺失值占总数据的比重,如果超过门槛者不进行补充 ''' chazhi_index = data[col][data[col].isna()].index if len(chazhi_index)/data.shape[0] >= 0.9: return data else: y = data[col][~(data[col].isna())].values x = data[col][~(data[col].isna())].index f = interp1d(x, y, kind = method) chazhi_index_2 = [c for c in chazhi_index if c<x.max() and c>x.min()] data.loc[chazhi_index_2,col] = f(chazhi_index_2) fill = data[col].fillna(method = 'ffill') bill = data[col].fillna(method = 'bfill') w = pd.DataFrame((fill,bill)).mean() data.iloc[:,0] = w return data def chazhi(data,group_id): ''' 针对整个dataframe, data:dataframe文件 group_id:样本的id ''' na_col = data.columns[data.isna().sum() != 0] for col in na_col[1:]: test = data[[col,group_id]] data[col] = test.groupby(group_id).apply(chazhi_,col= col,method = 'linear')[col] return data[data.isna().sum(1)==0] w = chazhi(df,'城市') w.isna().sum() 城市 0 year 0 pm2.5(达尔豪斯) 0 lnpm2.5(达尔豪斯) 0 第二产业增加值占GDP比重(%) 0 人均地区生产总值(元) 0 地区生产总值(万元) 0 年末人口总数(万人) 0 行政区域土地面积(全市) 0 能源效率 0 地方财政一般预算内支出 0 科学支出 0 年末实有城市道路面积(万平方米) 0 行政区域土地面积(市辖区) 0 dtype: int64 w[w['城市']== '晋中市'] 城市yearpm2.5(达尔豪斯)lnpm2.5(达尔豪斯)第二产业增加值占GDP比重(%)人均地区生产总值(元)地区生产总值(万元)年末人口总数(万人)行政区域土地面积(全市)能源效率地方财政一般预算内支出科学支出年末实有城市道路面积(万平方米)行政区域土地面积(市辖区)210晋中市200430.1173603.40510250.408300.02539148.0306.1216404.08.3377563050003503701327.0211晋中市200538.5810123.65276040.4610772.03310171.0306.6316404.08.8766203944965323881311.0212晋中市200645.3245753.81384942.3112491.03856794.0310.0016404.09.2150316640544974211311.0213晋中市200736.5783863.59945841.9415157.04701976.0313.8816404.07.98984472106458665211311.0214晋中市200834.3744823.53731438.7118219.05678066.0317.9016404.06.54799985679575695851311.0215晋中市200938.0597343.63915733.4420335.06368106.0320.0016404.07.236680103087384955921311.0216晋中市201049.9638073.91129936.4624275.07638366.0320.9616392.07.2218491205394101196161318.0217晋中市201177.1098224.34523140.1927300.08902376.0323.3016392.08.4641271541989119396161318.0218晋中市201257.9390104.05939140.6430093.09865596.0325.0016392.07.1952881787371140587691318.0219晋中市201354.5319283.99878639.6930977.010222281.0330.5016392.06.76153221363331728710101318.0220晋中市201445.4842013.81736539.6931434.010413000.0330.5016392.06.7615322168543191768501318.0221晋中市201542.3429213.74580131.3931434.010461155.0329.6616392.06.9543752411647190369651318.0222晋中市201635.9989733.58349030.0932646.010911041.0332.0016444.07.14721824716991513410951311.0223晋中市201735.1702593.56020131.9538274.012849233.0332.0016392.07.34006227892312456901318.0224晋中市201834.4520573.53956932.1442910.014476039.0334.0016444.07.55080032263682287114511311.0
最新回复(0)