how to split column of tuples in pandas dataframe?
You can do this by doing pd.DataFrame(col.tolist())
on that column:
In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [3]: df
Out[3]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [4]: df['b'].tolist()
Out[4]: [(1, 2), (3, 4)]
In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)
Out[5]:
0 1
0 1 2
1 3 4
In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)
In [7]: df
Out[7]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series)
instead of pd.DataFrame(df['b'].tolist(), index=df.index)
. That works as well (because it makes of each tuple a Series, which is then seen as a row of a dataframe), but is slower / uses more memory than the tolist
version, as noted by the other answers here (thanks to @denfromufa).
I updated this answer to make sure the most visible answer has the best solution.
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
The str
accessor that is available to pandas.Series
objects of dtype == object
is actually an iterable.
Assume a pandas.DataFrame
df
:
df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))
df
col
0 (a, 10)
1 (b, 20)
2 (c, 30)
3 (d, 40)
4 (e, 50)
5 (f, 60)
6 (g, 70)
7 (h, 80)
8 (i, 90)
9 (j, 100)
We can test if it is an iterable
from collections import Iterable
isinstance(df.col.str, Iterable)
True
We can then assign from it like we do other iterables:
var0, var1 = 'xy'
print(var0, var1)
x y
Simplest solution
So in one line we can assign both columns
df['a'], df['b'] = df.col.str
df
col a b
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
Faster solution
Only slightly more complicate, we can use zip
to create a similar iterable
df['c'], df['d'] = zip(*df.col)
df
col a b c d
0 (a, 10) a 10 a 10
1 (b, 20) b 20 b 20
2 (c, 30) c 30 c 30
3 (d, 40) d 40 d 40
4 (e, 50) e 50 e 50
5 (f, 60) f 60 f 60
6 (g, 70) g 70 g 70
7 (h, 80) h 80 h 80
8 (i, 90) i 90 i 90
9 (j, 100) j 100 j 100
Inline
Meaning, don't mutate existing df
This works because assign
takes keyword arguments where the keywords are the new(or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with **
and have it act as the keyword arguments. So this is a clever way of assigning a new column named 'g'
that is the first item in the df.col.str
iterable and 'h'
that is the second item in the df.col.str
iterable.
df.assign(**dict(zip('gh', df.col.str)))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
My version of the list
approach
With modern list comprehension and variable unpacking.
Note: also inline using join
df.join(pd.DataFrame([*df.col], df.index, [*'ef']))
col g h
0 (a, 10) a 10
1 (b, 20) b 20
2 (c, 30) c 30
3 (d, 40) d 40
4 (e, 50) e 50
5 (f, 60) f 60
6 (g, 70) g 70
7 (h, 80) h 80
8 (i, 90) i 90
9 (j, 100) j 100
The mutating version would be
df[['e', 'f']] = pd.DataFrame([*df.col], df.index)
Naive Time Test
Short DataFrameUse one defined above
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
1.16 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
635 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
795 µs ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Long DataFrame
10^3 times bigger
df = pd.concat([df] * 1000, ignore_index=True)
%timeit df.assign(**dict(zip('gh', df.col.str)))
%timeit df.assign(**dict(zip('gh', zip(*df.col))))
%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))
11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.1 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.33 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)