How to determine the length of lists in a pandas dataframe column?
pandas.Series.map(len)
andpandas.Series.apply(len)
are equivalent in execution time, and slightly faster thanpandas.Series.str.len()
.pandas.Series.map
pandas.Series.apply
pandas.Series.str.len
Difference between map, applymap and apply methods in Pandas
import pandas as pd
data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']
df = pd.DataFrame(data, index)
# create Length column
df['Length'] = df.os.map(len)
# display(df)
os Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
%timeit
import pandas as pd
import random
import string
random.seed(365)
ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use the str
accessor for some list operations as well. In this example,
df['CreationDate'].str.len()
returns the length of each list. See the docs for str.len
.
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop