Iterate over rows and expand pandas dataframe
Not as much interesting/fancy pandas usage, but this works:
import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})
It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all location
entries are already iterables, you can remove the atleast_1d
call, which gives about another 20% speedup.
If you return a series whose index
is a list of locations, then dfIn.apply
will collate those series into a table:
import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'],
['Antwerp','Barcelona','Pisa'] ]})
def expand(row):
locations = row['location'] if isinstance(row['location'], list) else [row['location']]
s = pd.Series(row['name'], index=list(set(locations)))
return s
In [156]: dfIn.apply(expand, axis=1)
Out[156]:
Amsterdam Antwerp Barcelona Berlin Paris Pisa
0 Tom NaN NaN NaN NaN NaN
1 NaN NaN NaN Jim Jim NaN
2 NaN Claus Claus NaN NaN Claus
You can then stack this DataFrame to obtain:
In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]:
0 Amsterdam Tom
1 Berlin Jim
Paris Jim
2 Antwerp Claus
Barcelona Claus
Pisa Claus
dtype: object
This is a Series, while you want a DataFrame. A little massaging with reset_index
gives you the desired result:
dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)
yields
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Amsterdam Claus
4 Antwerp Claus
5 Barcelona Claus