Pandas高级教程之:处理缺失数据
简介
在数据处理中,Pandas会将无法解析的数据或者缺失的数据使用NaN来表示。虽然所有的数据都有了相应的表示,但是NaN很明显是无法进行数学运算的。
本文将会讲解Pandas对于NaN数据的处理方法。
NaN的例子
上面讲到了缺失的数据会被表现为NaN,我们来看一个具体的例子:
我们先来构建一个DF:
In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
...:columns=['one', 'two', 'three'])
...: In [2]: df['four'] = 'bar'In [3]: df['five'] = df['one'] > 0In [4]: df
Out[4]:
onetwothree fourfive
a0.469112 -0.282863 -1.509059barTrue
c -1.1356321.212112 -0.173215barFalse
e0.119209 -1.044236 -0.861849barTrue
f -2.104569 -0.4949291.071804barFalse
h0.721555 -0.706771 -1.039575barTrue
上面DF只有acefh这几个index,我们重新index一下数据:
In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])In [6]: df2
Out[6]:
onetwothree fourfive
a0.469112 -0.282863 -1.509059barTrue
bNaNNaNNaNNaNNaN
c -1.1356321.212112 -0.173215barFalse
dNaNNaNNaNNaNNaN
e0.119209 -1.044236 -0.861849barTrue
f -2.104569 -0.4949291.071804barFalse
gNaNNaNNaNNaNNaN
h0.721555 -0.706771 -1.039575barTrue
数据缺失,就会产生很多NaN。
为了检测是否NaN,可以使用isna()或者notna() 方法。
In [7]: df2['one']
Out[7]:
a0.469112
bNaN
c-1.135632
dNaN
e0.119209
f-2.104569
gNaN
h0.721555
Name: one, dtype: float64In [8]: pd.isna(df2['one'])
Out[8]:
aFalse
bTrue
cFalse
dTrue
eFalse
fFalse
gTrue
hFalse
Name: one, dtype: boolIn [9]: df2['four'].notna()
Out[9]:
aTrue
bFalse
cTrue
dFalse
eTrue
fTrue
gFalse
hTrue
Name: four, dtype: bool
注意在Python中None是相等的:
In [11]: None == None# noqa: E711
Out[11]: True
但是np.nan是不等的:
In [12]: np.nan == np.nan
Out[12]: False
整数类型的缺失值 NaN默认是float类型的,如果是整数类型,我们可以强制进行转换:
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
01
12
2
34
dtype: Int64
Datetimes 类型的缺失值 时间类型的缺失值使用NaT来表示:
In [15]: df2 = df.copy()In [16]: df2['timestamp'] = pd.Timestamp('20120101')In [17]: df2
Out[17]:
onetwothree fourfivetimestamp
a0.469112 -0.282863 -1.509059barTrue 2012-01-01
c -1.1356321.212112 -0.173215barFalse 2012-01-01
e0.119209 -1.044236 -0.861849barTrue 2012-01-01
f -2.104569 -0.4949291.071804barFalse 2012-01-01
h0.721555 -0.706771 -1.039575barTrue 2012-01-01In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nanIn [19]: df2
Out[19]:
onetwothree fourfivetimestamp
aNaN -0.282863 -1.509059barTrueNaT
cNaN1.212112 -0.173215barFalseNaT
e0.119209 -1.044236 -0.861849barTrue 2012-01-01
f -2.104569 -0.4949291.071804barFalse 2012-01-01
hNaN -0.706771 -1.039575barTrueNaTIn [20]: df2.dtypes.value_counts()
Out[20]:
float643
datetime64[ns]1
bool1
object1
dtype: int64
None 和 np.nan 的转换 对于数字类型的,如果赋值为None,那么会转换为相应的NaN类型:
In [21]: s = pd.Series([1, 2, 3])In [22]: s.loc[0] = NoneIn [23]: s
Out[23]:
0NaN
12.0
23.0
dtype: float64
如果是对象类型,使用None赋值,会保持原样:
In [24]: s = pd.Series(["a", "b", "c"])In [25]: s.loc[0] = NoneIn [26]: s.loc[1] = np.nanIn [27]: s
Out[27]:
0None
1NaN
2c
dtype: object
缺失值的计算 缺失值的数学计算还是缺失值:
In [28]: a
Out[28]:
onetwo
aNaN -0.282863
cNaN1.212112
e0.119209 -1.044236
f -2.104569 -0.494929
h -2.104569 -0.706771In [29]: b
Out[29]:
onetwothree
aNaN -0.282863 -1.509059
cNaN1.212112 -0.173215
e0.119209 -1.044236 -0.861849
f -2.104569 -0.4949291.071804
hNaN -0.706771 -1.039575In [30]: a + b
Out[30]:
onethreetwo
aNaNNaN -0.565727
cNaNNaN2.424224
e0.238417NaN -2.088472
f -4.209138NaN -0.989859
hNaNNaN -1.413542
但是在统计中会将NaN当成0来对待。
In [31]: df
Out[31]:
onetwothree
aNaN -0.282863 -1.509059
cNaN1.212112 -0.173215
e0.119209 -1.044236 -0.861849
f -2.104569 -0.4949291.071804
hNaN -0.706771 -1.039575In [32]: df['one'].sum()
Out[32]: -1.9853605075978744In [33]: df.mean(1)
Out[33]:
a-0.895961
c0.519449
e-0.595625
f-0.509232
h-0.873173
dtype: float64
如果是在cumsum或者cumprod中,默认是会跳过NaN,如果不想统计NaN,可以加上参数skipna=False
In [34]: df.cumsum()
Out[34]:
onetwothree
aNaN -0.282863 -1.509059
cNaN0.929249 -1.682273
e0.119209 -0.114987 -2.544122
f -1.985361 -0.609917 -1.472318
hNaN -1.316688 -2.511893In [35]: df.cumsum(skipna=False)
Out[35]:
onetwothree
aNaN -0.282863 -1.509059
cNaN0.929249 -1.682273
eNaN -0.114987 -2.544122
fNaN -0.609917 -1.472318
hNaN -1.316688 -2.511893
使用fillna填充NaN数据 数据分析中,如果有NaN数据,那么需要对其进行处理,一种处理方法就是使用fillna来进行填充。
下面填充常量:
In [42]: df2
Out[42]:
onetwothree fourfivetimestamp
aNaN -0.282863 -1.509059barTrueNaT
cNaN1.212112 -0.173215barFalseNaT
e0.119209 -1.044236 -0.861849barTrue 2012-01-01
f -2.104569 -0.4949291.071804barFalse 2012-01-01
hNaN -0.706771 -1.039575barTrueNaTIn [43]: df2.fillna(0)
Out[43]:
onetwothree fourfivetimestamp
a0.000000 -0.282863 -1.509059barTrue0
c0.0000001.212112 -0.173215barFalse0
e0.119209 -1.044236 -0.861849barTrue2012-01-01 00:00:00
f -2.104569 -0.4949291.071804barFalse2012-01-01 00:00:00
h0.000000 -0.706771 -1.039575barTrue0
还可以指定填充方法,比如pad:
In [45]: df
Out[45]:
onetwothree
aNaN -0.282863 -1.509059
cNaN1.212112 -0.173215
e0.119209 -1.044236 -0.861849
f -2.104569 -0.4949291.071804
hNaN -0.706771 -1.039575In [46]: df.fillna(method='pad')
Out[46]:
onetwothree
aNaN -0.282863 -1.509059
cNaN1.212112 -0.173215
e0.119209 -1.044236 -0.861849
f -2.104569 -0.4949291.071804
h -2.104569 -0.706771 -1.039575
可以指定填充的行数:
In [48]: df.fillna(method='pad', limit=1)
fill方法统计:
方法名 | 描述 |
---|---|
pad / ffill | 向前填充 |
bfill / backfill | 向后填充 |
In [53]: dff
Out[53]:
ABC
00.271860 -0.4249720.567020
10.276232 -1.087401 -0.673690
20.113648 -1.4784270.524988
3NaN0.577046 -1.715002
4NaNNaN -1.157892
5 -1.344312NaNNaN
6 -0.1090501.643563NaN
70.357021 -0.674600NaN
8 -0.968914 -1.2945240.413738
90.276662 -0.472035 -0.013960In [54]: dff.fillna(dff.mean())
Out[54]:
ABC
00.271860 -0.4249720.567020
10.276232 -1.087401 -0.673690
20.113648 -1.4784270.524988
3 -0.1408570.577046 -1.715002
4 -0.140857 -0.401419 -1.157892
5 -1.344312 -0.401419 -0.293543
6 -0.1090501.643563 -0.293543
70.357021 -0.674600 -0.293543
8 -0.968914 -1.2945240.413738
90.276662 -0.472035 -0.013960In [55]: dff.fillna(dff.mean()['B':'C'])
Out[55]:
ABC
00.271860 -0.4249720.567020
10.276232 -1.087401 -0.673690
20.113648 -1.4784270.524988
3NaN0.577046 -1.715002
4NaN -0.401419 -1.157892
5 -1.344312 -0.401419 -0.293543
6 -0.1090501.643563 -0.293543
70.357021 -0.674600 -0.293543
8 -0.968914 -1.2945240.413738
90.276662 -0.472035 -0.013960
上面操作等同于:
In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
使用dropna删除包含NA的数据 除了fillna来填充数据之外,还可以使用dropna删除包含na的数据。
In [57]: df
Out[57]:
onetwothree
aNaN -0.282863 -1.509059
cNaN1.212112 -0.173215
eNaN0.0000000.000000
fNaN0.0000000.000000
hNaN -0.706771 -1.039575In [58]: df.dropna(axis=0)
Out[58]:
Empty DataFrame
Columns: [one, two, three]
Index: []In [59]: df.dropna(axis=1)
Out[59]:
twothree
a -0.282863 -1.509059
c1.212112 -0.173215
e0.0000000.000000
f0.0000000.000000
h -0.706771 -1.039575In [60]: df['one'].dropna()
Out[60]: Series([], Name: one, dtype: float64)
插值interpolation 数据分析时候,为了数据的平稳,我们需要一些插值运算interpolate() ,使用起来很简单:
In [61]: ts
Out[61]:
2000-01-310.469112
2000-02-29NaN
2000-03-31NaN
2000-04-28NaN
2000-05-31NaN
...
2007-12-31-6.950267
2008-01-31-7.904475
2008-02-29-6.441779
2008-03-31-8.184940
2008-04-30-9.011531
Freq: BM, Length: 100, dtype: float64
In [64]: ts.interpolate()
Out[64]:
2000-01-310.469112
2000-02-290.434469
2000-03-310.399826
2000-04-280.365184
2000-05-310.330541
...
2007-12-31-6.950267
2008-01-31-7.904475
2008-02-29-6.441779
2008-03-31-8.184940
2008-04-30-9.011531
Freq: BM, Length: 100, dtype: float64
插值函数还可以添加参数,指定插值的方法,比如按时间插值:
In [67]: ts2
Out[67]:
2000-01-310.469112
2000-02-29NaN
2002-07-31-5.785037
2005-01-31NaN
2008-04-30-9.011531
dtype: float64In [68]: ts2.interpolate()
Out[68]:
2000-01-310.469112
2000-02-29-2.657962
2002-07-31-5.785037
2005-01-31-7.398284
2008-04-30-9.011531
dtype: float64In [69]: ts2.interpolate(method='time')
Out[69]:
2000-01-310.469112
2000-02-290.270241
2002-07-31-5.785037
2005-01-31-7.190866
2008-04-30-9.011531
dtype: float64
按index的float value进行插值:
In [70]: ser
Out[70]:
0.00.0
1.0NaN
10.010.0
dtype: float64In [71]: ser.interpolate()
Out[71]:
0.00.0
1.05.0
10.010.0
dtype: float64In [72]: ser.interpolate(method='values')
Out[72]:
0.00.0
1.01.0
10.010.0
dtype: float64
除了插值Series,还可以插值DF:
In [73]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
....:'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
....: In [74]: df
Out[74]:
AB
01.00.25
12.1NaN
2NaNNaN
34.74.00
45.612.20
56.814.40In [75]: df.interpolate()
Out[75]:
AB
01.00.25
12.11.50
23.42.75
34.74.00
45.612.20
56.814.40
interpolate还接收limit参数,可以指定插值的个数。
In [95]: ser.interpolate(limit=1)
Out[95]:
0NaN
1NaN
25.0
37.0
4NaN
5NaN
613.0
713.0
8NaN
dtype: float64
使用replace替换值 replace可以替换常量,也可以替换list:
In [102]: ser = pd.Series([0., 1., 2., 3., 4.])In [103]: ser.replace(0, 5)
Out[103]:
05.0
11.0
22.0
33.0
44.0
dtype: float64
In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
Out[104]:
04.0
13.0
22.0
31.0
40.0
dtype: float64
可以替换DF中特定的数值:
In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})In [107]: df.replace({'a': 0, 'b': 5}, 100)
Out[107]:
ab
0100100
116
227
338
449
可以使用插值替换:
In [108]: ser.replace([1, 2, 3], method='pad')
Out[108]:
00.0
10.0
20.0
30.0
44.0
dtype: float64
本文已收录于 http://www.flydean.com/07-python-pandas-missingdata/
最通俗的解读,最深刻的干货,最简洁的教程,众多你不知道的小技巧等你来发现!
【Pandas高级教程之:处理缺失数据】欢迎关注我的公众号:「程序那些事」,懂技术,更懂你!
推荐阅读
- 2.6|2.6 Photoshop操作步骤的撤消和重做 [Ps教程]
- 漫画初学者如何学习漫画背景的透视画法(这篇教程请收藏好了!)
- 唐嫣可真会穿,西装搭牛仔裤都能穿出高级感,一双大长腿太抢镜
- 鹿鸣高级营养老师徐老师分享应该注意的6种食物
- Java基础-高级特性-枚举实现状态机
- HTTP高级(Cookie,Session|HTTP高级(Cookie,Session ,LocalStorage )
- 用npm发布一个包的教程并编写一个vue的插件发布
- 程序员|【高级Java架构师系统学习】毕业一年萌新的Java大厂面经,最新整理
- 20180322【w4复盘日志】
- 狗狗定点大小便视频教程下载地址