文章目录
- Concat与Append操作
-
- 通过pd.concat实现简易合并
-
- 索引重复
-
- 捕获索引重复的错误(参数verify_interity)
- 忽略索引(参数ignore_index)
- 增加多级索引(参数keys)
- 类似join的合并
-
- 用join参数设置合并方式(所有列 or 列的交集)
- append()方法
- 合并数据集:合并(merge)与连接(join)
-
- 数据连接类型(pd.merge)
-
- 一对一
- 多对一
- 多对多
- 设置数据合并的键
-
- 参数on的用法
- left_on与right_on参数
-
- drop去除多余列
- left_index与right_index参数
- 设置数据连接的集合操作规则
-
- how参数设置连接方式
- 重复列名:suffixes参数
Concat与Append操作 通过pd.concat实现简易合并
# 定义一个创建DataFrame某种形式的函数
def make_df(cols, ind):
data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data,ind)## 通过pd.concat实现简易合并
ser1 = pd.Series(['A','B','C'],index=[1,2,3])
ser2 = pd.Series(['D','E','F'],index=[4,5,6])
print(pd.concat([ser1,ser2]))
'''
1A
2B
3C
4D
5E
6F
dtype: object
'''
df1 = make_df('AB',[1,2])
df2 = make_df('AB',[3,4])
print(df1)
'''
AB
1A1B1
2A2B2
'''
print(df2)
'''
AB
3A3B3
4A4B4
'''
print(pd.concat([df1,df2]))
'''
AB
1A1B1
2A2B2
3A3B3
4A4B4
'''
df3 = make_df('AB',[0,1])
df4 = make_df('CD',[0,1])
print(df3)
'''
AB
0A0B0
1A1B1
'''
print(df4)
'''
CD
0C0D0
1C1D1
'''
print(pd.concat([df3,df4],axis=1))
'''
ABCD
0A0B0C0D0
1A1B1C1D1
'''
索引重复
# 定义一个创建DataFrame某种形式的函数
def make_df(cols, ind):
data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data,ind)# 索引重复
# np.concatrnate与pd.concat区别是Pandas会保留索引,即使索引是重复的
x = make_df('AB',[0,1])
y = make_df('AB',[2,3])
y.index = x.index # 复制索引
print(x)
'''
AB
0A0B0
1A1B1
'''
print(y)
'''
AB
0A2B2
1A3B3
'''
print(pd.concat([x,y]))
'''
AB
0A0B0
1A1B1
0A2B2
1A3B3
'''
结果索引是重复的,pd.concat提供了解决该问题的方法
捕获索引重复的错误(参数verify_interity) 设置参数verify_interity为True,合并有重复时会触发异常
try:
pd.concat([x,y], verify_integrity=True)
except ValueError as e:
print("ValueError:",e)# ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
忽略索引(参数ignore_index) 设置参数ignore_index为True,合并会创建新的整数索引
print(pd.concat([x,y],ignore_index=True))
'''
AB
0A0B0
1A1B1
2A2B2
3A3B3
'''
增加多级索引(参数keys) 设置参数keys为数据源设置多级索引标签
print(pd.concat([x,y],keys=['x','y']))
'''
AB
x 0A0B0
1A1B1
y 0A2B2
1A3B3
'''
类似join的合并
# 定义一个创建DataFrame某种形式的函数
def make_df(cols, ind):
data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data,ind)x = make_df('ABC',[1,2])
y = make_df('BCD',[3,4])
y.index = x.index # 复制索引
print(x)
'''
ABC
1A1B1C1
2A2B2C2
'''
print(y)
'''
BCD
1B3C3D3
2B4C4D4
'''
print(pd.concat([x,y]))
'''
ABCD
1A1B1C1NaN
2A2B2C2NaN
1NaNB3C3D3
2NaNB4C4D4
'''
用join参数设置合并方式(所有列 or 列的交集) 缺失的数据用NaN表示,如果不想这样,可以用join参数设置合并方式,默认方式是对所有列进行合并(join=‘outer’),join='inner’是对列的交集合并
print(pd.concat([x,y],join='inner'))
'''
BC
1B1C1
2B2C2
1B3C3
2B4C4
'''
append()方法
x.append(y)效果与pd.concat([x,y])一样
# 定义一个创建DataFrame某种形式的函数
def make_df(cols, ind):
data = https://www.it610.com/article/{c:[str(c)+str(i) for i in ind] for c in cols}
return pd.DataFrame(data,ind)x = make_df('AB',[1,2])
y = make_df('AB',[3,4])
print(x)
'''
AB
1A1B1
2A2B2
'''
print(y)
'''
AB
3A3B3
4A4B4
'''
print(pd.concat([x,y]))
'''
AB
1A1B1
2A2B2
3A3B3
4A4B4
'''
print(x.append(y))
'''
AB
1A1B1
2A2B2
3A3B3
4A4B4
'''
合并数据集:合并(merge)与连接(join) 数据连接类型(pd.merge) 一对一
共同列的位置可以是不一样,merge会正确处理该问题,另外,merge会自动丢弃原来的行索引,可以自定义
df1 = pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'],
'group':['Accounting','Engineering','Engineering','HR']})
df2 = pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'],
'hirer_data':[2004,2008,2012,2014]})
print(df1)
'''
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
'''
print(df2)
'''
employeehirer_data
0Lisa2004
1Bob2008
2Jake2012
3Sue2014
'''
df3 = pd.merge(df1, df2)
print(df3)
'''
employeegrouphirer_data
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
'''
多对一
指连接的两个列中有一列的值有重复,结果中的supervisor会因为group有重复而有重复
df4 = pd.DataFrame({'group':['Accounting','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']})
print(df3)
'''
employeegrouphirer_data
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
'''
print(df4)
'''
group supervisor
0AccountingCarly
1EngineeringGuido
2HRSteve
'''
print(pd.merge(df3,df4))
'''
employeegrouphirer_data supervisor
0BobAccounting2008Carly
1JakeEngineering2012Guido
2LisaEngineering2004Guido
3SueHR2014Steve
'''
多对多
两个输入的共同列都包含重复值,则为多对多
df5 = pd.DataFrame({'group':['Accounting','Accounting','Engineering','Engineering','HR','HR'],
'skills':['math','spreadsheets','coding','linux','spreadsheets','organization']})
print(df1)
'''
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
'''
print(df5)
'''
groupskills
0Accountingmath
1Accountingspreadsheets
2Engineeringcoding
3Engineeringlinux
4HRspreadsheets
5HRorganization
'''
print(pd.merge(df1,df5))
'''
employeegroupskills
0BobAccountingmath
1BobAccountingspreadsheets
2JakeEngineeringcoding
3JakeEngineeringlinux
4LisaEngineeringcoding
5LisaEngineeringlinux
6SueHRspreadsheets
7SueHRorganization
'''
设置数据合并的键 merge默认将两个输入的一个或多个共同列作为键进行合并。但由于两个输入要合并的列通常不是同名的,merge提供了一些参数处理
参数on的用法
将参数on设置为一个列名字符串或者一个包含多个列名称的列表,这个参数只能在两个DataFrame有共同列名的时候才可以使用
df1 = pd.DataFrame({'employee':['Bob','Jake','Lisa','Sue'],
'group':['Accounting','Engineering','Engineering','HR']})
df2 = pd.DataFrame({'employee':['Lisa','Bob','Jake','Sue'],
'hirer_data':[2004,2008,2012,2014]})
print(df1)
'''
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
'''
print(df2)
'''
employeehirer_data
0Lisa2004
1Bob2008
2Jake2012
3Sue2014
'''
print(pd.merge(df1,df2,on='employee'))
'''
employeegrouphirer_data
0BobAccounting2008
1JakeEngineering2012
2LisaEngineering2004
3SueHR2014
'''
left_on与right_on参数
合并两个列名不同的数据集
df3 = pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'salary':[70000,80000,120000,90000]})
print(df1)
'''
employeegroup
0BobAccounting
1JakeEngineering
2LisaEngineering
3SueHR
'''
print(df3)
'''
namesalary
0Bob70000
1Jake80000
2Lisa120000
3Sue90000
'''
print(pd.merge(df1,df3,left_on="employee",right_on="name"))
'''
employeegroupnamesalary
0BobAccountingBob70000
1JakeEngineeringJake80000
2LisaEngineeringLisa120000
3SueHRSue90000
'''
drop去除多余列
print(pd.merge(df1,df3,left_on="employee",right_on="name").drop('name',axis=1))
'''
employeegroupsalary
0BobAccounting70000
1JakeEngineering80000
2LisaEngineering120000
3SueHR90000
'''
left_index与right_index参数
【Python|Pandas合并数据集】除了合并列之外,可能还需要合并索引
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
print(df1a)
'''
group
employee
BobAccounting
JakeEngineering
LisaEngineering
SueHR
'''
print(df2a)
'''
hirer_data
employee
Lisa2004
Bob2008
Jake2012
Sue2014
'''
print(pd.merge(df1a,df2a,left_index=True,right_index=True))
'''
grouphirer_data
employee
BobAccounting2008
JakeEngineering2012
LisaEngineering2004
SueHR2014
'''
为了方便考虑,DataFrame实现了join方法,可以按照索引进行数据合并
print(df1a.join(df2a))
'''
grouphirer_data
employee
BobAccounting2008
JakeEngineering2012
LisaEngineering2004
SueHR2014
'''
如果想要将索引与列混合使用,可以通过结合left_index和right_on,或者结合left_on与right_index来实现
print(df1a)
'''
group
employee
BobAccounting
JakeEngineering
LisaEngineering
SueHR
'''
print(df3)
'''
namesalary
0Bob70000
1Jake80000
2Lisa120000
3Sue90000
'''
print(pd.merge(df1a,df3,left_index=True,right_on='name'))
'''
groupnamesalary
0AccountingBob70000
1EngineeringJake80000
2EngineeringLisa120000
3HRSue90000
'''
设置数据连接的集合操作规则 当一个值出现在一列,却没有出现在另一列时,需要考虑集合操作规则
df6 = pd.DataFrame({'name':['Peter','Paul','Mary'],
'food':['fish','beans','bread']},
columns=['name','food'])
df7 = pd.DataFrame({'name':['Mary','Joseph'],
'drink':['wine','beer']},
columns=['name','drink'])
print(df6)
'''
namefood
0Peterfish
1Paulbeans
2Marybread
'''
print(df7)
'''
name drink
0Marywine
1Josephbeer
'''
print(pd.merge(df6,df7))
'''
namefood drink
0Marybreadwine
'''
合并两个数据集,在name列中只有一个共同值Mary。默认结果只输出两个集合的交集,成为内连接
how参数设置连接方式
#how的值有inner(内连接),outer(外连接),left(左连接),right(有连接)
print(pd.merge(df6,df7,how='inner'))
'''
namefood drink
0Marybreadwine
'''
print(pd.merge(df6,df7,how='outer'))
'''
namefood drink
0PeterfishNaN
1PaulbeansNaN
2Marybreadwine
3JosephNaNbeer
'''
print(pd.merge(df6,df7,how='left'))
'''
namefood drink
0PeterfishNaN
1PaulbeansNaN
2Marybreadwine
'''
print(pd.merge(df6,df7,how='right'))
'''
namefood drink
0Marybreadwine
1JosephNaNbeer
'''
重复列名:suffixes参数 可能两个输入DataFrame有重名列的情况
df8 = pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'rank':[1,2,3,4]})
df9 = pd.DataFrame({'name':['Bob','Jake','Lisa','Sue'],
'rank':[3,1,4,2]})
print(df8)
'''
namerank
0Bob1
1Jake2
2Lisa3
3Sue4
'''
print(df9)
'''
namerank
0Bob3
1Jake1
2Lisa4
3Sue2
'''
print(pd.merge(df8,df9,on="name"))# 列名一样,会自动添加后缀rank_xrank_y
'''
namerank_xrank_y
0Bob13
1Jake21
2Lisa34
3Sue42
'''
print(pd.merge(df8,df9,on="name",suffixes=["_L","_R"])) # suffixes定义后缀
'''
namerank_Lrank_R
0Bob13
1Jake21
2Lisa34
3Sue42
'''
推荐阅读
- python|python编写冒泡算法
- VUE|Flask+VUE 实现页面增删改查显示开发+测试(图文教程附源码)
- #|Python学习笔记 - Python编程规范
- python练习题|python基础20道小练习
- 【原创】Selenium获取请求头、响应头
- 编程语言|TIOBE 5 月编程语言排行榜(Python、C++ 竞争白热化,Objective-C 已沦为小众语言)
- 大数据|从CVPR 2021看计算机视觉的现状
- 算法|机器学习 python 库_Python机器学习库
- 算法|2022数维杯全网思路+数据+代码