Pandas中GroupBy具体用法详解
目录
- 简介
- 分割数据
- 多index
- get_group
- dropna
- groups属性
- index的层级
- group的遍历
- 聚合操作
- 通用聚合方法
- 同时使用多个聚合方法
- NamedAgg
- 不同的列指定不同的聚合方法
- 转换操作
- 过滤操作
- Apply操作
简介
pandas中的DF数据类型可以像数据库表格一样进行groupby操作。通常来说groupby操作可以分为三部分:分割数据,应用变换和和合并数据。
本文将会详细讲解Pandas中的groupby操作。
分割数据
分割数据的目的是将DF分割成为一个个的group。为了进行groupby操作,在创建DF的时候需要指定相应的label:
df = pd.DataFrame(...:{...:"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],...:"B": ["one", "one", "two", "three", "two", "two", "one", "three"],...:"C": np.random.randn(8),...:"D": np.random.randn(8),...:}...: )...:dfOut[61]: ABCD0fooone -0.490565 -0.2331061barone0.4300891.0407892footwo0.653449 -1.1555303barthree -0.610380 -0.4477354footwo -0.9349610.2563585bartwo -0.256263 -0.6619546fooone -1.132186 -0.3043307foothree2.1297570.445744
默认情况下,groupby的轴是x轴。可以一列group,也可以多列group:
In [8]: grouped = df.groupby("A")In [9]: grouped = df.groupby(["A", "B"])
多index
在0.24版本中,如果我们有多index,可以从中选择特定的index进行group:
In [10]: df2 = df.set_index(["A", "B"])In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))In [12]: grouped.sum()Out[12]: CDAbar -1.591710 -1.739537foo -0.752861 -1.402938
get_group
get_group 可以获取分组之后的数据:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})In [25]: df3.groupby(["X"]).get_group("A")Out[25]: XY0A12A3In [26]: df3.groupby(["X"]).get_group("B")Out[26]: XY1B43B2
dropna
默认情况下,NaN数据会被排除在groupby之外,通过设置 dropna=False 可以允许NaN数据:
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])In [29]: df_dropnaOut[29]: abc012.0311NaN4221.03312.02# Default ``dropna`` is set to True, which will exclude NaNs in keysIn [30]: df_dropna.groupby(by=["b"], dropna=True).sum()Out[30]: acb1.0232.025# In order to allow NaN in keys, set ``dropna`` to FalseIn [31]: df_dropna.groupby(by=["b"], dropna=False).sum()Out[31]: acb1.0232.025NaN14
groups属性
groupby对象有个groups属性,它是一个key-value字典,key是用来分类的数据,value是分类对应的值。
In [34]: grouped = df.groupby(["A", "B"])In [35]: grouped.groupsOut[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}In [36]: len(grouped)Out[36]: 6
index的层级
对于多级index对象,groupby可以指定group的index层级:
In [40]: arrays = [....:["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],....:["one", "two", "one", "two", "one", "two", "one", "two"],....: ]....: In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])In [42]: s = pd.Series(np.random.randn(8), index=index)In [43]: sOut[43]: firstsecondbarone-0.919854two-0.042379bazone1.247642two-0.009920fooone0.290213two0.495767quxone0.362949two1.548106dtype: float64
group第一级:
In [44]: grouped = s.groupby(level=0)In [45]: grouped.sum()Out[45]: firstbar-0.962232baz1.237723foo0.785980qux1.911055dtype: float64
group第二级:
In [46]: s.groupby(level="second").sum()Out[46]: secondone0.980950two1.991575dtype: float64
group的遍历
得到group对象之后,我们可以通过for语句来遍历group:
In [62]: grouped = df.groupby('A')In [63]: for name, group in grouped:....:print(name)....:print(group)....: barABCD1barone0.2541611.5117633barthree0.215897 -0.9905825bartwo -0.0771181.211526fooABCD0fooone -0.5752471.3460612footwo -1.1437041.6270814footwo1.193555 -0.4416526fooone -0.4085300.2685207foothree -0.8624950.024580
如果是多字段group,group的名字是一个元组:
In [64]: for name, group in df.groupby(['A', 'B']):....:print(name)....:print(group)....: ('bar', 'one')ABCD1barone0.2541611.511763('bar', 'three')ABCD3barthree0.215897 -0.990582('bar', 'two')ABCD5bartwo -0.0771181.211526('foo', 'one')ABCD0fooone -0.5752471.3460616fooone -0.4085300.268520('foo', 'three')ABCD7foothree -0.8624950.02458('foo', 'two')ABCD2footwo -1.1437041.6270814footwo1.193555 -0.441652
聚合操作
分组之后,就可以进行聚合操作:
In [67]: grouped = df.groupby("A")In [68]: grouped.aggregate(np.sum)Out[68]: CDAbar0.3929401.732707foo -1.7964212.824590In [69]: grouped = df.groupby(["A", "B"])In [70]: grouped.aggregate(np.sum)Out[70]: CDABbar one0.2541611.511763three0.215897 -0.990582two-0.0771181.211526foo one-0.9837761.614581three -0.8624950.024580two0.0498511.185429
对于多index数据来说,默认返回值也是多index的。如果想使用新的index,可以添加 as_index = False:
In [71]: grouped = df.groupby(["A", "B"], as_index=False)In [72]: grouped.aggregate(np.sum)Out[72]: ABCD0barone0.2541611.5117631barthree0.215897 -0.9905822bartwo -0.0771181.2115263fooone -0.9837761.6145814foothree -0.8624950.0245805footwo0.0498511.185429In [73]: df.groupby("A", as_index=False).sum()Out[73]: ACD0bar0.3929401.7327071foo -1.7964212.824590
上面的效果等同于reset_index
In [74]: df.groupby(["A", "B"]).sum().reset_index()grouped.size() 计算group的大小:In [75]: grouped.size()Out[75]: ABsize0barone11barthree12bartwo13fooone24foothree15footwo2
grouped.describe() 描述group的信息:
In [76]: grouped.describe()Out[76]: C...Dcountmeanstdmin25%50%...stdmin25%50%75%max01.00.254161NaN0.2541610.2541610.254161...NaN1.5117631.5117631.5117631.5117631.51176311.00.215897NaN0.2158970.2158970.215897...NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.99058221.0 -0.077118NaN -0.077118 -0.077118 -0.077118...NaN1.2115261.2115261.2115261.2115261.21152632.0 -0.4918880.117887 -0.575247 -0.533567 -0.491888...0.7619370.2685200.5379050.8072911.0766761.34606141.0 -0.862495NaN -0.862495 -0.862495 -0.862495...NaN0.0245800.0245800.0245800.0245800.02458052.00.0249251.652692 -1.143704 -0.5593890.024925...1.462816 -0.4416520.0755310.5927141.1098981.627081[6 rows x 16 columns]
通用聚合方法
下面是通用的聚合方法:
函数 | 描述 |
---|---|
mean() | 平均值 |
sum() | 求和 |
size() | 计算size |
count() | group的统计 |
std() | 标准差 |
var() | 方差 |
sem() | 均值的标准误 |
describe() | 统计信息描述 |
first() | 第一个group值 |
last() | 最后一个group值 |
nth() | 第n个group值 |
min() | 最小值 |
max() | 最大值 |
同时使用多个聚合方法
可以同时指定多个聚合方法:
In [81]: grouped = df.groupby("A")In [82]: grouped["C"].agg([np.sum, np.mean, np.std])Out[82]: summeanstdAbar0.3929400.1309800.181231foo -1.796421 -0.3592840.912265
可以重命名:
In [84]: (....:grouped["C"]....:.agg([np.sum, np.mean, np.std])....:.rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})....: )....: Out[84]: foobarbazAbar0.3929400.1309800.181231foo -1.796421 -0.3592840.912265
NamedAgg
NamedAgg 可以对聚合进行更精准的定义,它包含 column 和aggfunc 两个定制化的字段。
In [88]: animals = pd.DataFrame(....:{....:"kind": ["cat", "dog", "cat", "dog"],....:"height": [9.1, 6.0, 9.5, 34.0],....:"weight": [7.9, 7.5, 9.9, 198.0],....:}....: )....:In [89]: animalsOut[89]: kindheightweight0cat9.17.91dog6.07.52cat9.59.93dog34.0198.0In [90]: animals.groupby("kind").agg(....:min_height=pd.NamedAgg(column="height", aggfunc="min"),....:max_height=pd.NamedAgg(column="height", aggfunc="max"),....:average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),....: )....: Out[90]: min_heightmax_heightaverage_weightkindcat9.19.58.90dog6.034.0102.75
或者直接使用一个元组:
In [91]: animals.groupby("kind").agg(....:min_height=("height", "min"),....:max_height=("height", "max"),....:average_weight=("weight", np.mean),....: )....: Out[91]: min_heightmax_heightaverage_weightkindcat9.19.58.90dog6.034.0102.75
不同的列指定不同的聚合方法
通过给agg方法传入一个字典,可以指定不同的列使用不同的聚合:
In [95]: grouped.agg({"C": "sum", "D": "std"})Out[95]: CDAbar0.3929401.366330foo -1.7964210.884785
转换操作
转换是将对象转换为同样大小对象的操作。在数据分析的过程中,经常需要进行数据的转换操作。
可以接lambda操作:
In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
填充na值:
In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
过滤操作
filter方法可以通过lambda表达式来过滤我们不需要的数据:
In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)Out[137]: 334353dtype: int64
Apply操作
有些数据可能不适合进行聚合或者转换操作,Pandas提供了一个 apply 方法,用来进行更加灵活的转换操作。
In [156]: dfOut[156]: ABCD0fooone -0.5752471.3460611barone0.2541611.5117632footwo -1.1437041.6270813barthree0.215897 -0.9905824footwo1.193555 -0.4416525bartwo -0.0771181.2115266fooone -0.4085300.2685207foothree -0.8624950.024580In [157]: grouped = df.groupby("A")# could also just call .describe()In [158]: grouped["C"].apply(lambda x: x.describe())Out[158]: Abarcount3.000000mean0.130980std0.181231min-0.07711825%0.069390...foomin-1.14370425%-0.86249550%-0.57524775%-0.408530max1.193555Name: C, Length: 16, dtype: float64
可以外接函数:
In [159]: grouped = df.groupby('A')['C']In [160]: def f(group):.....:return pd.DataFrame({'original': group,.....:'demeaned': group - group.mean()}).....:In [161]: grouped.apply(f)Out[161]: originaldemeaned0 -0.575247 -0.21596210.2541610.1231812 -1.143704 -0.78442030.2158970.08491741.1935551.5528395 -0.077118 -0.2080986 -0.408530 -0.0492457 -0.862495 -0.503211
【Pandas中GroupBy具体用法详解】到此这篇关于Pandas中GroupBy具体用法详解的文章就介绍到这了,更多相关Pandas GroupBy内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!
推荐阅读
- VUE中鼠标滚轮使div左右滚动的方法详解
- 财经|中投民生(巴菲特从打新Snowflake中“豪赚”8亿美元。)
- Laravel|Laravel 框架中使用 MongoDB 数据库的操作
- 如何优雅的修改node_modules中的依赖库
- 帅到起飞,在thymeleaf中使用自定义工具类
- Vue中的ESLint配置方式
- 解决mybatis中resultType取出数据顺序不一致的问题
- 详细聊聊Vue.js中的MVVM
- 『无为则无心』Python基础|『无为则无心』Python基础 — 63、Python中的生成器
- 20个你应该了解的Flutter库