Select multiple groups from pandas groupby object
You can do something like
new_gb = pandas.concat( [ gb.get_group(group) for i,group in enumerate( gb.groups) if i < 5 ] ).groupby('model')
new_gb.hist()
Although, I would approach it differently. You can use the collections.Counter
object to get groups fast:
import collections
df = pandas.DataFrame.from_dict({'model': pandas.np.random.randint(0, 3, 10), 'param1': pandas.np.random.random(10), 'param2':pandas.np.random.random(10)})
# model param1 param2
#0 2 0.252379 0.985290
#1 1 0.059338 0.225166
#2 0 0.187259 0.808899
#3 2 0.773946 0.696001
#4 1 0.680231 0.271874
#5 2 0.054969 0.328743
#6 0 0.734828 0.273234
#7 0 0.776684 0.661741
#8 2 0.098836 0.013047
#9 1 0.228801 0.827378
model_groups = collections.Counter(df.model)
print(model_groups) #Counter({2: 4, 0: 3, 1: 3})
Now you can iterate over the Counter
object like a dictionary, and query the groups you want:
new_df = pandas.concat( [df.query('model==%d'%key) for key,val in model_groups.items() if val < 4 ] ) # for example, but you can select the models however you like
# model param1 param2
#2 0 0.187259 0.808899
#6 0 0.734828 0.273234
#7 0 0.776684 0.661741
#1 1 0.059338 0.225166
#4 1 0.680231 0.271874
#9 1 0.228801 0.827378
Now you can use the built-in pandas.DataFrame.groupby
function
gb = new_df.groupby('model')
gb.hist()
Since model_groups
contains all of the groups, you can just pick from it as you wish.
note
If your model
column contains string values (names or something) instead of integers, it will all work the same - just change the query argument from 'model==%d'%key
to 'model=="%s"'%key
.
gbidx=list(gb.indices.keys())[:4]
dfidx=np.sort(np.concatenate([gb.indices[x] for x in gbidx]))
df.loc[dfidx].groupby('model').hist()
gb.indices is faster than gb.groups or list(gb)
and I believe concat Index is faster than concat DataFrames
I've tried on my big csv file of ~416M rows 13 cols (incl. str) and 720MB in size, and groupby by more than one col
then changed col names into those in the Question
I don't know of a way to use the .get_group()
method with more than one group.
You can, however, iterate through groups
It is still a bit ugly to do this, but here is one solution with iteration:
limit = 5
i = 0
for key, group in gd:
print(key, group)
i += 1
if i >= limit:
break
You could also do a loop with .get_group()
, which imho, is a little prettier, but still quite ugly.
for key in list(gd.groups.keys())[:2]:
print(gd.get_group(key))
It'd be easier to just filter your df first and then perform the groupby
:
In [155]:
df = pd.DataFrame({'model':np.random.randint(1,10,100), 'value':np.random.randn(100)})
first_five = df['model'].sort(inplace=False).unique()[:5]
gp = df[df['model'].isin(first_five)].groupby('model')
gp.first()
Out[155]:
value
model
1 -0.505677
2 1.217027
3 -0.641583
4 0.778104
5 -1.037858