Concatenate all columns in a pandas dataframe

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

Gives us:

df
Out[6]: 
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

To select a given set of columns:

df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')

df
Out[8]: 
   A  B  C concat
0  1  4  7     14
1  2  5  8     25
2  3  6  9     36

However, I've noticed that approach can sometimes result in NaNs being populated where they shouldn't, so here's another way:

>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
   A  B  C concat
0  1  4  7    147
1  2  5  8    258
2  3  6  9    369

Although it should be noted that this approach is a lot slower:

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop

Versus

$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop

I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.

For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:

df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')

Here is my modification for Python 3.6.3:

df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))

Solution with sum, but output is float, so convert to int and str is necessary:

df['new'] = df.sum(axis=1).astype(int).astype(str)

Another solution with apply function join, but it the slowiest:

df['new'] = df.apply(''.join, axis=1)

Last very fast numpy solution - convert to numpy array and then 'sum':

df['new'] = df.values.sum(axis=1)

Timings:

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)

cols = list('ABC')

#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop

In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop

In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop

In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop

EDIT If dtypes of some columns are not object (obviously strings) cast by DataFrame.astype:

df['new'] = df.astype(str).values.sum(axis=1)

Concatenate all columns in a pandas dataframe

Tags:

Python

Pandas

Related

Recent Posts