Replace NaN with empty list in a pandas dataframe

You can also use a list comprehension for this:

d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]

This works using isnull and loc to mask the series:

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using apply in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

EDIT

Using your updated sample the following works:

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:

def empty_assign_1(s):
    s[s.isna()].apply(lambda x: [])

def empty_assign_2(s):
    [[]] * s.isna().sum()

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 61 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 2.17 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!

EDIT: Fixed a bug pointed out by @valentin

You have to be somewhat careful with data types when performing assignment in this case. In the example above, the test series is float, however, adding [] elements coerces the entire series to object. Pandas will handle that for you if you do something like

idx = series.isna()
series[isna] = series[isna].apply(lambda x: [])

Because the output of apply is itself a series. You can test live performance with assignment overhead like so (I've added a string value so the series with be an object, you could instead use a number as the replacement value rather than an empty list to avoid coercion).

def empty_assign_1(s):
    idx = s.isna()
    s[idx] = s[idx].apply(lambda x: [])

def empty_assign_2(s):
    idx = s.isna()
    s.loc[idx] = [[]] * idx.sum()

series = pd.Series(np.random.choice([1, 2, np.nan, '2'], 1000000))

%timeit empty_assign_1(series.copy())
>>> 45.1 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series.copy())
>>> 24 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

About 4 ms of that is related to the copy, 10x to 2x, still pretty great.

Replace NaN with empty list in a pandas dataframe

Tags:

Python

Pandas

Dataframe

Related

Recent Posts