Python|Pandas合并数据集


文章目录

  • Concat与Append操作
    • 通过pd.concat实现简易合并
      • 索引重复
        • 捕获索引重复的错误(参数verify_interity)
        • 忽略索引(参数ignore_index)
        • 增加多级索引(参数keys)
      • 类似join的合并
        • 用join参数设置合并方式(所有列 or 列的交集)
      • append()方法
  • 合并数据集:合并(merge)与连接(join)
    • 数据连接类型(pd.merge)
      • 一对一
      • 多对一
      • 多对多
    • 设置数据合并的键
      • 参数on的用法
      • left_on与right_on参数
        • drop去除多余列
      • left_index与right_index参数
    • 设置数据连接的集合操作规则
      • how参数设置连接方式
    • 重复列名:suffixes参数

Concat与Append操作 通过pd.concat实现简易合并
# 定义一个创建DataFrame某种形式的函数 def make_df(cols, ind): data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols} return pd.DataFrame(data,ind)## 通过pd.concat实现简易合并 ser1 = pd.Series(['A','B','C'],index=[1,2,3]) ser2 = pd.Series(['D','E','F'],index=[4,5,6]) print(pd.concat([ser1,ser2])) ''' 1A 2B 3C 4D 5E 6F dtype: object ''' df1 = make_df('AB',[1,2]) df2 = make_df('AB',[3,4]) print(df1) ''' AB 1A1B1 2A2B2 ''' print(df2) ''' AB 3A3B3 4A4B4 ''' print(pd.concat([df1,df2])) ''' AB 1A1B1 2A2B2 3A3B3 4A4B4 ''' df3 = make_df('AB',[0,1]) df4 = make_df('CD',[0,1]) print(df3) ''' AB 0A0B0 1A1B1 ''' print(df4) ''' CD 0C0D0 1C1D1 ''' print(pd.concat([df3,df4],axis=1)) ''' ABCD 0A0B0C0D0 1A1B1C1D1 '''

索引重复
# 定义一个创建DataFrame某种形式的函数 def make_df(cols, ind): data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols} return pd.DataFrame(data,ind)# 索引重复 # np.concatrnate与pd.concat区别是Pandas会保留索引,即使索引是重复的 x = make_df('AB',[0,1]) y = make_df('AB',[2,3]) y.index = x.index # 复制索引 print(x) ''' AB 0A0B0 1A1B1 ''' print(y) ''' AB 0A2B2 1A3B3 ''' print(pd.concat([x,y])) ''' AB 0A0B0 1A1B1 0A2B2 1A3B3 '''

结果索引是重复的,pd.concat提供了解决该问题的方法
捕获索引重复的错误(参数verify_interity) 设置参数verify_interity为True,合并有重复时会触发异常
try: pd.concat([x,y], verify_integrity=True) except ValueError as e: print("ValueError:",e)# ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')

忽略索引(参数ignore_index) 设置参数ignore_index为True,合并会创建新的整数索引
print(pd.concat([x,y],ignore_index=True)) ''' AB 0A0B0 1A1B1 2A2B2 3A3B3 '''

增加多级索引(参数keys) 设置参数keys为数据源设置多级索引标签
print(pd.concat([x,y],keys=['x','y'])) ''' AB x 0A0B0 1A1B1 y 0A2B2 1A3B3 '''

类似join的合并
# 定义一个创建DataFrame某种形式的函数 def make_df(cols, ind): data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols} return pd.DataFrame(data,ind)x = make_df('ABC',[1,2]) y = make_df('BCD',[3,4]) y.index = x.index # 复制索引 print(x) ''' ABC 1A1B1C1 2A2B2C2 ''' print(y) ''' BCD 1B3C3D3 2B4C4D4 ''' print(pd.concat([x,y])) ''' ABCD 1A1B1C1NaN 2A2B2C2NaN 1NaNB3C3D3 2NaNB4C4D4 '''

用join参数设置合并方式(所有列 or 列的交集) 缺失的数据用NaN表示,如果不想这样,可以用join参数设置合并方式,默认方式是对所有列进行合并(join=‘outer’),join='inner’是对列的交集合并
print(pd.concat([x,y],join='inner')) ''' BC 1B1C1 2B2C2 1B3C3 2B4C4 '''

append()方法
x.append(y)效果与pd.concat([x,y])一样
# 定义一个创建DataFrame某种形式的函数 def make_df(cols, ind): data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols} return pd.DataFrame(data,ind)x = make_df('AB',[1,2]) y = make_df('AB',[3,4]) print(x) ''' AB 1A1B1 2A2B2 ''' print(y) ''' AB 3A3B3 4A4B4 ''' print(pd.concat([x,y])) ''' AB 1A1B1 2A2B2 3A3B3 4A4B4 ''' print(x.append(y)) ''' AB 1A1B1 2A2B2 3A3B3 4A4B4 '''

合并数据集:合并(merge)与连接(join) 数据连接类型(pd.merge) 一对一
共同列的位置可以是不一样,merge会正确处理该问题,另外,merge会自动丢弃原来的行索引,可以自定义
df1 = pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'], 'group':['Accounting','Engineering','Engineering','HR']}) df2 = pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'], 'hirer_data':[2004,2008,2012,2014]}) print(df1) ''' employeegroup 0BobAccounting 1JakeEngineering 2LisaEngineering 3SueHR ''' print(df2) ''' employeehirer_data 0Lisa2004 1Bob2008 2Jake2012 3Sue2014 ''' df3 = pd.merge(df1, df2) print(df3) ''' employeegrouphirer_data 0BobAccounting2008 1JakeEngineering2012 2LisaEngineering2004 3SueHR2014 '''

多对一
指连接的两个列中有一列的值有重复,结果中的supervisor会因为group有重复而有重复
df4 = pd.DataFrame({'group':['Accounting','Engineering','HR'], 'supervisor':['Carly','Guido','Steve']}) print(df3) ''' employeegrouphirer_data 0BobAccounting2008 1JakeEngineering2012 2LisaEngineering2004 3SueHR2014 ''' print(df4) ''' group supervisor 0AccountingCarly 1EngineeringGuido 2HRSteve ''' print(pd.merge(df3,df4)) ''' employeegrouphirer_data supervisor 0BobAccounting2008Carly 1JakeEngineering2012Guido 2LisaEngineering2004Guido 3SueHR2014Steve '''

多对多
两个输入的共同列都包含重复值,则为多对多
df5 = pd.DataFrame({'group':['Accounting','Accounting','Engineering','Engineering','HR','HR'], 'skills':['math','spreadsheets','coding','linux','spreadsheets','organization']}) print(df1) ''' employeegroup 0BobAccounting 1JakeEngineering 2LisaEngineering 3SueHR ''' print(df5) ''' groupskills 0Accountingmath 1Accountingspreadsheets 2Engineeringcoding 3Engineeringlinux 4HRspreadsheets 5HRorganization ''' print(pd.merge(df1,df5)) ''' employeegroupskills 0BobAccountingmath 1BobAccountingspreadsheets 2JakeEngineeringcoding 3JakeEngineeringlinux 4LisaEngineeringcoding 5LisaEngineeringlinux 6SueHRspreadsheets 7SueHRorganization '''

设置数据合并的键 merge默认将两个输入的一个或多个共同列作为键进行合并。但由于两个输入要合并的列通常不是同名的,merge提供了一些参数处理
参数on的用法
将参数on设置为一个列名字符串或者一个包含多个列名称的列表,这个参数只能在两个DataFrame有共同列名的时候才可以使用
df1 = pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'], 'group':['Accounting','Engineering','Engineering','HR']}) df2 = pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'], 'hirer_data':[2004,2008,2012,2014]}) print(df1) ''' employeegroup 0BobAccounting 1JakeEngineering 2LisaEngineering 3SueHR ''' print(df2) ''' employeehirer_data 0Lisa2004 1Bob2008 2Jake2012 3Sue2014 ''' print(pd.merge(df1,df2,on='employee')) ''' employeegrouphirer_data 0BobAccounting2008 1JakeEngineering2012 2LisaEngineering2004 3SueHR2014 '''

left_on与right_on参数
合并两个列名不同的数据集
df3 = pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'], 'salary':[70000,80000,120000,90000]}) print(df1) ''' employeegroup 0BobAccounting 1JakeEngineering 2LisaEngineering 3SueHR ''' print(df3) ''' namesalary 0Bob70000 1Jake80000 2Lisa120000 3Sue90000 ''' print(pd.merge(df1,df3,left_on="employee",right_on="name")) ''' employeegroupnamesalary 0BobAccountingBob70000 1JakeEngineeringJake80000 2LisaEngineeringLisa120000 3SueHRSue90000 '''

drop去除多余列
print(pd.merge(df1,df3,left_on="employee",right_on="name").drop('name',axis=1)) ''' employeegroupsalary 0BobAccounting70000 1JakeEngineering80000 2LisaEngineering120000 3SueHR90000 '''

left_index与right_index参数
【Python|Pandas合并数据集】除了合并列之外,可能还需要合并索引
df1a = df1.set_index('employee') df2a = df2.set_index('employee') print(df1a) ''' group employee BobAccounting JakeEngineering LisaEngineering SueHR ''' print(df2a) ''' hirer_data employee Lisa2004 Bob2008 Jake2012 Sue2014 ''' print(pd.merge(df1a,df2a,left_index=True,right_index=True)) ''' grouphirer_data employee BobAccounting2008 JakeEngineering2012 LisaEngineering2004 SueHR2014 '''

为了方便考虑,DataFrame实现了join方法,可以按照索引进行数据合并
print(df1a.join(df2a)) ''' grouphirer_data employee BobAccounting2008 JakeEngineering2012 LisaEngineering2004 SueHR2014 '''

如果想要将索引与列混合使用,可以通过结合left_index和right_on,或者结合left_on与right_index来实现
print(df1a) ''' group employee BobAccounting JakeEngineering LisaEngineering SueHR ''' print(df3) ''' namesalary 0Bob70000 1Jake80000 2Lisa120000 3Sue90000 ''' print(pd.merge(df1a,df3,left_index=True,right_on='name')) ''' groupnamesalary 0AccountingBob70000 1EngineeringJake80000 2EngineeringLisa120000 3HRSue90000 '''

设置数据连接的集合操作规则 当一个值出现在一列,却没有出现在另一列时,需要考虑集合操作规则
df6 = pd.DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']}, columns=['name','food']) df7 = pd.DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}, columns=['name','drink']) print(df6) ''' namefood 0Peterfish 1Paulbeans 2Marybread ''' print(df7) ''' name drink 0Marywine 1Josephbeer ''' print(pd.merge(df6,df7)) ''' namefood drink 0Marybreadwine '''

合并两个数据集,在name列中只有一个共同值Mary。默认结果只输出两个集合的交集,成为内连接
how参数设置连接方式
#how的值有inner(内连接),outer(外连接),left(左连接),right(有连接)
print(pd.merge(df6,df7,how='inner')) ''' namefood drink 0Marybreadwine ''' print(pd.merge(df6,df7,how='outer')) ''' namefood drink 0PeterfishNaN 1PaulbeansNaN 2Marybreadwine 3JosephNaNbeer ''' print(pd.merge(df6,df7,how='left')) ''' namefood drink 0PeterfishNaN 1PaulbeansNaN 2Marybreadwine ''' print(pd.merge(df6,df7,how='right')) ''' namefood drink 0Marybreadwine 1JosephNaNbeer '''

重复列名:suffixes参数 可能两个输入DataFrame有重名列的情况
df8 = pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'], 'rank':[1,2,3,4]}) df9 = pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'], 'rank':[3,1,4,2]}) print(df8) ''' namerank 0Bob1 1Jake2 2Lisa3 3Sue4 ''' print(df9) ''' namerank 0Bob3 1Jake1 2Lisa4 3Sue2 ''' print(pd.merge(df8,df9,on="name"))# 列名一样,会自动添加后缀rank_xrank_y ''' namerank_xrank_y 0Bob13 1Jake21 2Lisa34 3Sue42 ''' print(pd.merge(df8,df9,on="name",suffixes=["_L","_R"])) # suffixes定义后缀 ''' namerank_Lrank_R 0Bob13 1Jake21 2Lisa34 3Sue42 '''

    推荐阅读