Fill in missing pandas data with previous non-missing value, grouped by key

Solution for multi-key problem:

In this example, the data has the key [date, region, type]. Date is the index on the original dataframe.

import os
import pandas as pd

#sort to make indexing faster
df.sort_values(by=['date','region','type'], inplace=True)

#collect all possible regions and types
regions = list(set(df['region']))
types = list(set(df['type']))

#record column names
df_cols = df.columns

#delete ffill_df.csv so we can begin anew
try:
    os.remove('ffill_df.csv')
except FileNotFoundError:
    pass

# steps:
# 1) grab rows with a particular region and type
# 2) use forwardfill to fill nulls
# 3) use backwardfill to fill remaining nulls
# 4) append to file
for r in regions:
    for t in types:
        group_df = df[(df.region == r) & (df.type == t)].copy()
        group_df.fillna(method='ffill', inplace=True)
        group_df.fillna(method='bfill', inplace=True)
        group_df.to_csv('ffill_df.csv', mode='a', header=False, index=True)

Checking the result:

#load in the ffill_df
ffill_df = pd.read_csv('ffill_df.csv', header=None, index_col=None)
ffill_df.columns = df_reindexed_cols
ffill_df.index= ffill_df.date
ffill_df.drop('date', axis=1, inplace=True)
ffill_df.head()

#compare new and old dataframe
print(df.shape)        
print(ffill_df.shape)
print()
print(pd.isnull(ffill_df).sum())

You could perform a groupby/forward-fill operation on each group:

import numpy as np
import pandas as pd

df = pd.DataFrame({'id': [1,1,2,2,1,2,1,1], 'x':[10,20,100,200,np.nan,np.nan,300,np.nan]})
df['x'] = df.groupby(['id'])['x'].ffill()
print(df)

yields

   id      x
0   1   10.0
1   1   20.0
2   2  100.0
3   2  200.0
4   1   20.0
5   2  200.0
6   1  300.0
7   1  300.0

df
   id   val
0   1   23.0
1   1   NaN
2   1   NaN
3   2   NaN
4   2   34.0
5   2   NaN
6   3   2.0
7   3   NaN
8   3   NaN

df.sort_values(['id','val']).groupby('id').ffill()

    id  val
0   1   23.0
1   1   23.0
2   1   23.0
4   2   34.0
3   2   34.0
5   2   34.0
6   3   2.0
7   3   2.0
8   3   2.0

use sort_values, groupby and ffill so that if you have Nan value for the first value or set of first values they also get filled.

Fill in missing pandas data with previous non-missing value, grouped by key

Solution for multi-key problem:

Tags:

Python

Pandas

Nan

Missing Data

Data Cleaning

Related

Recent Posts