Replace NaN with empty list in a pandas dataframe
You can also use a list comprehension for this:
d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]
This works using isnull
and loc
to mask the series:
In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d
Out[90]:
0 [1, 2, 3]
1 [1, 2]
2 []
3 []
dtype: object
In [91]:
d.apply(len)
Out[91]:
0 3
1 2
2 0
3 0
dtype: int64
You have to do this using apply
in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series
EDIT
Using your updated sample the following works:
In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d
Out[100]:
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [] 3
3 [] 4
In [102]:
d['x'].apply(len)
Out[102]:
0 3
1 2
2 0
3 0
Name: x, dtype: int64
To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.
isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values
A quick timing comparison:
def empty_assign_1(s):
s[s.isna()].apply(lambda x: [])
def empty_assign_2(s):
[[]] * s.isna().sum()
series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))
%timeit empty_assign_1(series)
>>> 61 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit empty_assign_2(series)
>>> 2.17 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Nearly 10 times faster!
EDIT: Fixed a bug pointed out by @valentin
You have to be somewhat careful with data types when performing assignment in this case. In the example above, the test series is float, however, adding []
elements coerces the entire series to object. Pandas will handle that for you if you do something like
idx = series.isna()
series[isna] = series[isna].apply(lambda x: [])
Because the output of apply is itself a series. You can test live performance with assignment overhead like so (I've added a string value so the series with be an object, you could instead use a number as the replacement value rather than an empty list to avoid coercion).
def empty_assign_1(s):
idx = s.isna()
s[idx] = s[idx].apply(lambda x: [])
def empty_assign_2(s):
idx = s.isna()
s.loc[idx] = [[]] * idx.sum()
series = pd.Series(np.random.choice([1, 2, np.nan, '2'], 1000000))
%timeit empty_assign_1(series.copy())
>>> 45.1 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit empty_assign_2(series.copy())
>>> 24 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
About 4 ms of that is related to the copy, 10x to 2x, still pretty great.