這篇groupby寫的不好。太復(fù)雜了。其實(shí)實(shí)際上經(jīng)常用的就那么幾個(gè)。舉個(gè)例子,把常用的往那一放就很容易理解和拿來(lái)用了。日后再寫一篇。
groupby默認(rèn)縱方向上分組,axis=0
import pandas as pdimport numpy as np
df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'], 'key2':['one', 'two', 'one', 'two', 'one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)})print(df)
data1 data2 key1 key20 -0.410122 0.247895 a one1 -0.627470 -0.989268 a two2 0.179488 -0.054570 b one3 -0.299878 -1.640494 b two4 -0.297191 0.954447 a one
list(df.groupby(['key1']))#list后得到:[(group1),(group2),......]
[('a', data1 data2 key1 key2 0 -0.410122 0.247895 a one 1 -0.627470 -0.989268 a two 4 -0.297191 0.954447 a one), ('b', data1 data2 key1 key2 2 0.179488 -0.054570 b one 3 -0.299878 -1.640494 b two)]
list后得到:[(group1),(group2),…]
每個(gè)數(shù)據(jù)片(group)格式: (name,group)元組
groupby對(duì)象支持迭代,產(chǎn)生一組二元元組:(分組名,數(shù)據(jù)塊),(分組名,數(shù)據(jù)塊)…
for name,group in df.groupby(['key1']): print(name) print(group)
a data1 data2 key1 key20 -0.410122 0.247895 a one1 -0.627470 -0.989268 a two4 -0.297191 0.954447 a oneb data1 data2 key1 key22 0.179488 -0.054570 b one3 -0.299878 -1.640494 b two
對(duì)于多重鍵,產(chǎn)生的一組二元元組:((k1,k2),數(shù)據(jù)塊),((k1,k2),數(shù)據(jù)塊)…
第一個(gè)元素是由鍵值組成的元組
for name,group in df.groupby(['key1','key2']): print(name) #name=(k1,k2) print(group)
('a', 'one') data1 data2 key1 key20 -0.410122 0.247895 a one4 -0.297191 0.954447 a one('a', 'two') data1 data2 key1 key21 -0.62747 -0.989268 a two('b', 'one') data1 data2 key1 key22 0.179488 -0.05457 b one('b', 'two') data1 data2 key1 key23 -0.299878 -1.640494 b two
dict(list(df.groupby(['key1'])))#dict(list())
{'a': data1 data2 key1 key2 0 -0.410122 0.247895 a one 1 -0.627470 -0.989268 a two 4 -0.297191 0.954447 a one, 'b': data1 data2 key1 key2 2 0.179488 -0.054570 b one 3 -0.299878 -1.640494 b two}
df.groupby(['key1']).size()
key1a 3b 2dtype: int64
dict(['a1','x2','e3'])
{'a': '1', 'e': '3', 'x': '2'}
df.groupby(['key1','key2']).size()
key1 key2a one 2 two 1b one 1 two 1dtype: int64
df['data1'].groupby(df['key1']).mean()#groupby沒(méi)有進(jìn)行任何的計(jì)算。它只是進(jìn)行了一個(gè)分組
key1a -0.444928b -0.060195Name: data1, dtype: float64
df.groupby(['key1'])['data1'].mean()#理解:對(duì)df按key1分組,并計(jì)算分組后df['data1']的均值#等價(jià)于:df.groupby(['key1']).data1.mean()
key1a -0.444928b -0.060195Name: data1, dtype: float64
說(shuō)明:
groupby沒(méi)有進(jìn)行任何的計(jì)算。它只是進(jìn)行了一個(gè)分組。
數(shù)據(jù)(Series)根據(jù)分組鍵進(jìn)行了聚合,產(chǎn)生了一個(gè)新的Series,其索引為key1列中的唯一值。
這種索引操作所返回的對(duì)象是一個(gè)已分組的DataFrame(如果傳入的是列表或數(shù)組)或已分組的Series
df.groupby(['key1'])['data1'].size()
key1a 3b 2Name: data1, dtype: int64
df['data1'].groupby([df['key1'],df['key2']]).mean()
key1 key2a one -0.353657 two -0.627470b one 0.179488 two -0.299878Name: data1, dtype: float64
df.groupby(['key1','key2'])['data1'].mean()#等價(jià)于:df.groupby(['key1','key2']).data1'.mean()
key1 key2a one -0.353657 two -0.627470b one 0.179488 two -0.299878Name: data1, dtype: float64
通過(guò)兩個(gè)鍵對(duì)數(shù)據(jù)進(jìn)行了分組,得到的Series具有一個(gè)層次化索引(由唯一的鍵對(duì)組成):
df.groupby(['key1','key2'])['data1'].mean().unstack()
key2 | one | two |
---|---|---|
key1 | ||
a | -0.353657 | -0.627470 |
b | 0.179488 | -0.299878 |
在上面這些示例中,分組鍵均為Series。實(shí)際上,分組鍵可以是任何長(zhǎng)度適當(dāng)?shù)臄?shù)組。非常靈活。
df共兩種數(shù)據(jù)類型:float64和object,所以會(huì)分為兩組(dtype(‘float64’),數(shù)據(jù)片),(dtype(‘O’), 數(shù)據(jù)片)
list(df.groupby(df.dtypes, axis=1))
[(dtype('float64'), data1 data2 0 -0.410122 0.247895 1 -0.627470 -0.989268 2 0.179488 -0.054570 3 -0.299878 -1.640494 4 -0.297191 0.954447), (dtype('O'), key1 key2 0 a one 1 a two 2 b one 3 b two 4 a one)]
SeriesGroupBy的方法agg()參數(shù):
aggregate(self, func_or_funcs, * args, ** kwargs)
func: function, string, dictionary, or list of string/functions
返回:aggregated的Series
s= pd.Series([10,20,30,40])s
0 101 202 303 40dtype: int64
for n,g in s.groupby([1,1,2,2]): print(n) print(g)
10 101 20dtype: int6422 303 40dtype: int64
s.groupby([1,1,2,2]).min()
1 102 30dtype: int64
#等價(jià)于這個(gè):s.groupby([1,1,2,2]).agg('min')
1 102 30dtype: int64
s.groupby([1,1,2,2]).agg(['min','max'])#加[],func僅接受一個(gè)參數(shù)
min | max | |
---|---|---|
1 | 10 | 20 |
2 | 30 | 40 |
常常這樣用:
df
data1 | data2 | key1 | key2 | |
---|---|---|---|---|
0 | -0.410122 | 0.247895 | a | one |
1 | -0.627470 | -0.989268 | a | two |
2 | 0.179488 | -0.054570 | b | one |
3 | -0.299878 | -1.640494 | b | two |
4 | -0.297191 | 0.954447 | a | one |
df.groupby(['key1'])['data1'].min()
key1a -0.627470b -0.299878Name: data1, dtype: float64
df.groupby(['key1'])['data1'].agg({'min'})
min | |
---|---|
key1 | |
a | -0.627470 |
b | -0.299878 |
#推薦用這個(gè)√df.groupby(['key1']).agg({'data1':'min'})#對(duì)data1列,取各組的最小值,名字還是data1
data1 | |
---|---|
key1 | |
a | -0.627470 |
b | -0.299878 |
#按key1分組后,aggregate各組data1的最小值和最大值:df.groupby(['key1'])['data1'].agg({'min','max'})
max | min | |
---|---|---|
key1 | ||
a | -0.297191 | -0.627470 |
b | 0.179488 | -0.299878 |
#推薦用這個(gè)√df.groupby(['key1']).agg({'data1':['min','max']})
data1 | ||
---|---|---|
min | max | |
key1 | ||
a | -0.627470 | -0.297191 |
b | -0.299878 | 0.179488 |
# 對(duì)data1,把min更名為a,max更名為bdf.groupby(['key1'])['data1'].agg({'a':'min','b':'max'})#這里的'min' 'max'為兩個(gè)函數(shù)名
d:\python27\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregationis deprecated and will be removed in a future version
a | b | |
---|---|---|
key1 | ||
a | -0.627470 | -0.297191 |
b | -0.299878 | 0.179488 |
聯(lián)系客服