Splitting a string into list and converting the items to int
No native "vectorised" solution is possible
I'm highlighting this because it's a common mistake to assume pd.Series.str
methods are vectorised. They aren't. They offer convenience and error-handling at the cost of efficiency. For clean data only, e.g. no NaN
values, a list comprehension is likely your best option:
df = pd.DataFrame({'A': ['16 0', '7 1 2 0', '5', '1', '18']})
df['B'] = [list(map(int, i.split())) for i in df['A']]
print(df)
A B
0 16 0 [16, 0]
1 7 1 2 0 [7, 1, 2, 0]
2 5 [5]
3 1 [1]
4 18 [18]
Performance benchmarking
To illustrate performance issues with pd.Series.str
, you can see for larger dataframes how the more operations you pass to Pandas, the more performance deteriorates:
df = pd.concat([df]*10000)
%timeit [list(map(int, i.split())) for i in df['A']] # 55.6 ms
%timeit [list(map(int, i)) for i in df['A'].str.split()] # 80.2 ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x))) # 93.6 ms
list
as elements in pd.Series
is also anti-Pandas
As described here, holding lists in series gives 2 layers of pointers and is not recommended:
Don't do this. Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.
The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of
object
dtype, which represents a sequence of pointers, much likelist
. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.
The double for
comprehension is 33% faster than the map
comprehension from the jpp's answer. Numba trick is 250 times faster than the map
comprehension from jpp's answer, but you get a pandas DataFrame with floats and nan
's and not a series of lists. Numba is included in Anaconda.
Benchmarks:
%timeit pd.DataFrame(nb_calc(df.A)) # numba trick 0.144 ms
%timeit [int(x) for i in df['A'] for x in i.split()] # 23.6 ms
%timeit [list(map(int, i.split())) for i in df['A']] # 35.6 ms
%timeit [list(map(int, i)) for i in df['A'].str.split()] # 50.9 ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x))) # 56.6 ms
Code for Numba function:
@numba.jit(nopython=True, nogil=True)
def str2int_nb(nb_a):
n1 = nb_a.shape[0]
n2 = nb_a.shape[1]
res = np.empty(nb_a.shape)
res[:] = np.nan
j_res_max = 0
for i in range(n1):
j_res = 0
s = 0
for j in range(n2):
x = nb_a[i,j]
if x == 32:
res[i,j_res]=np.float64(s)
s=0
j_res+=1
elif x == 0:
break
else:
s=s*10+x-48
res[i,j_res]=np.float64(s)
if j_res>j_res_max:
j_res_max = j_res
return res[:,:j_res_max+1]
def nb_calc(s):
a_temp = s_a.values.astype("U")
nb_a = a_temp.view("uint32").reshape(len(s_a),-1).astype(np.int8)
str2int_nb(nb_a)
Numba does not support strings. So I first convert to array of int8 and only then work with it. Conversion to int8 actually takes 3/4 of the execution time.
The output of my numba function looks like this:
0 1 2 3
-----------------------
0 16.0 0.0 NaN NaN
1 7.0 1.0 2.0 0.0
2 5.0 NaN NaN NaN
3 1.0 NaN NaN NaN
4 18.0 NaN NaN NaN