当一个或多个项目或整个单元没有提供信息时,可能会出现丢失数据。在现实生活中,丢失数据是一个很大的问题,往往找半天还找不回来。
在Pandas中,缺少的数据由两个值表示:
- None:None是Python单例对象,通常用于丢失Python代码中的数据。
- NaN(非数字的缩写),是所有使用标准ieee浮点表示的系统所认可的特殊浮点值。
在Pandas DataFrame中有几个用于检测、删除和替换空值的有用函数:
- isnull()
- notnull()
- dropna()
- fillna()
- replace()
- interpolate()
使用isnull() 为了检查PandasDataFrame中的空值,我们使用isnull()函数返回布尔值的数据,这些值是NaN值的真值。
代码1:
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]}# creating a dataframe from list df = pd.DataFrame(dict)# using isnull() function df.isnull()
产出:
文章图片
代码2:
# importing pandas package import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv") # creating bool series True for NaN values bool_series = pd.isnull(data["Gender"]) # filtering data # displaying data only with Gender = NaN data[bool_series]
产出:
如输出映像所示,只有具有Gender = NULL都会显示。
文章图片
使用notnull() 为了检查PandasDataframe中的空值,我们使用NOTNULL()函数来返回对于NaN值为false的布尔值的数据。
代码3:
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]}# creating a dataframe using dictionary df = pd.DataFrame(dict)# using notnull() function df.notnull()
产出:
文章图片
代码4:
# importing pandas package import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv") # creating bool series True for NaN values bool_series = pd.notnull(data["Gender"]) # filtering data # displayind data only with Gender = Not NaN data[bool_series]
产出:
如输出映像所示,只有具有Gender = NOT NULL都会显示。
文章图片
使用fillna(), replace()和interpolate() 使用fillna(), replace()和interpolate()函数这些函数将NaN值替换为它们自己的一些值。在DataFrame的数据集中填充空值。
插值()函数主要用于填充NA数据中的值,使用各种插值技术来填充丢失的值,不是对值进行硬编码。
代码1:用单个值填充空值
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]}# creating a dataframe from dictionary df = pd.DataFrame(dict)# filling missing value using fillna() df.fillna(0)
产出:
文章图片
代码2:用前面的值填充空值
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]}# creating a dataframe from dictionary df = pd.DataFrame(dict)# filling a missing value with # previous ones df.fillna(method ='pad')
产出:
文章图片
代码3:用下一个值填充空值
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score':[np.nan, 40, 80, 98]}# creating a dataframe from dictionary df = pd.DataFrame(dict)# fillingnull value using fillna() function df.fillna(method ='bfill')
产出:
文章图片
代码4:在CSV文件中填充空值
# importing pandas package import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv")# Printing the first 10 to 24 rows of # the data frame for visualization data[10:25]
文章图片
现在,我们将用“无性别”填充性别列中的所有空值。
# importing pandas package import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv") # filling a null values using fillna() data["Gender"].fillna("No Gender", inplace = True) data
产出:
文章图片
代码5:使用替换()方法填充空值
# importing pandas package import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv")# Printing the first 10 to 24 rows of # the data frame for visualization data[10:25]
产出:
文章图片
现在,我们将将数据帧中的ALNAN值替换为-99值。
# importing pandas package import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv") # will replaceNan value in dataframe with value -99 data.replace(to_replace = np.nan, value = https://www.it610.com/article/-99)
产出:
文章图片
代码6:使用插值()函数来使用线性方法填充缺失的值。
# importing pandas as pd import pandas as pd # Creating the dataframe df = pd.DataFrame({"A":[12, 4, 5, None, 1], "B":[None, 2, 54, 3, None], "C":[20, 16, None, 3, 8], "D":[14, 3, None, None, 6]}) # Print the dataframe df
文章图片
让我们用线性方法插值缺失的值。请注意,线性方法忽略索引,并将值视为等距。
# to interpolate the missing values df.interpolate(method ='linear', limit_direction ='forward')
产出:
文章图片
正如我们可以看到的输出,第一行中的值无法被填充,因为填充值的方向是向前的,并且没有以前的值可以用于插值。
使用dropna() 从dataframe中删除空值,使用dropna()函数以不同的方式删除具有Null值的数据集的行/列。
代码1:删除至少1空值的行。
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, 40, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]}# creating a dataframe from dictionary df = pd.DataFrame(dict)df
文章图片
使用至少一个Nan值(Null值)删除行。
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, 40, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]}# creating a dataframe from dictionary df = pd.DataFrame(dict)# using dropna() function df.dropna()
产出:
文章图片
代码2:如果该行中的所有值都丢失,则删除行。
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, np.nan, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]}# creating a dataframe from dictionary df = pd.DataFrame(dict)df
文章图片
删除所有数据丢失或包含空值(Nan)的行。
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, np.nan, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]}df = pd.DataFrame(dict)# using dropna() function df.dropna(how = 'all')
产出:
文章图片
代码3:删除至少1空值的列。
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, np.nan, 80, 98], 'Fourth Score':[60, 67, 68, 65]}# creating a dataframe from dictionary df = pd.DataFrame(dict)df
文章图片
删除至少有1个缺失值的列。
# importing pandas as pd import pandas as pd# importing numpy as np import numpy as np# dictionary of lists dict = {'First Score':[100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, np.nan, 80, 98], 'Fourth Score':[60, 67, 68, 65]}# creating a dataframe from dictionary df = pd.DataFrame(dict)# using dropna() function df.dropna(axis = 1)
产出:
文章图片
代码4:在CSV文件中删除至少1空值的行
# importing pandas module import pandas as pd # making data frame from csv file data = https://www.it610.com/article/pd.read_csv("employees.csv") # making new data frame with dropped NA values new_data = https://www.it610.com/article/data.dropna(axis = 0, how ='any') new_data
产出:
文章图片
现在我们比较数据帧的大小,这样我们就可以知道有多少行至少有一个空值。
print("Old data frame length:", len(data)) print("New data frame length:", len(new_data)) print("Number of rows with at least 1 NA value: ", (len(data)-len(new_data)))
产出:
Old data frame length: 1000 New data frame length: 764 Number of rows with at least 1 NA value:236
由于差值为236,因此在任何列中都有236行,其中至少有1空值。
【编程语言|Python干货宝典(如何处理Pandas中丢失的数据)】
推荐阅读
- 数据分析|Python | Pandas | 多列映射匹配到新列
- 上榜中国大数据企业50强,思迈特软件再夺多项荣誉
- python|victoriametrics的prometheus高可用性和容错策略长期存储
- 事物序列化_大规模测量每件事物m3时间序列简介
- 计算机视觉|ECCV2022|何恺明团队开源ViTDet(只用普通ViT,不做分层设计也能搞定目标检测...)
- Python|数据分析(实战模拟)
- Python的模块调用
- 计算机视觉|计算机视觉 图像基本操作
- opencv|图像基础入门--图像基本操作