Concatenate all columns in a pandas dataframe
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Gives us:
df
Out[6]:
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
To select a given set of columns:
df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')
df
Out[8]:
A B C concat
0 1 4 7 14
1 2 5 8 25
2 3 6 9 36
However, I've noticed that approach can sometimes result in NaN
s being populated where they shouldn't, so here's another way:
>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Although it should be noted that this approach is a lot slower:
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop
Versus
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop
I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.
For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Here is my modification for Python 3.6.3:
df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))
Solution with sum
, but output is float
, so convert to int
and str
is necessary:
df['new'] = df.sum(axis=1).astype(int).astype(str)
Another solution with apply
function join
, but it the slowiest:
df['new'] = df.apply(''.join, axis=1)
Last very fast numpy solution
- convert to numpy array
and then 'sum':
df['new'] = df.values.sum(axis=1)
Timings:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
cols = list('ABC')
#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop
In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop
In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop
In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop
EDIT If dtypes of some columns are not object
(obviously string
s) cast by DataFrame.astype
:
df['new'] = df.astype(str).values.sum(axis=1)