What is the most efficient way to fill missing values in this data frame?
t = df.groupby('ID')['year'].agg(['min','max']).reset_index()
t['missing'] = t.transform(lambda x: [y for y in range(x['min'], x['max']+1) if y not in x.values], axis=1)
t = t[['ID','missing']].explode('missing').dropna()
t['number'] = 0
t.columns = ['ID','year','number']
pd.concat([df,t]).sort_values(by=['ID','year'])
Output
ID year number
0 A 2017 1
0 A 2018 0
1 A 2019 1
2 B 2017 1
3 B 2018 1
4 C 2016 1
2 C 2017 0
2 C 2018 0
5 C 2019 1
A slightly faster approach rather than using explode
is to use pd.Series constructor. And you can use .iloc if years are already sorted from earliest to latest.
idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
Output:
ID year number
0 A 2017 1
1 A 2018 0
2 A 2019 1
3 B 2017 1
4 B 2018 1
5 C 2016 1
6 C 2017 0
7 C 2018 0
8 C 2019 1
Here is another approach with reindex
u = df.groupby('ID')['year'].apply(lambda x: range(x.min(),x.max()+1)).explode()
out = (df.set_index(['ID','year']).reindex(u.reset_index().to_numpy(),fill_value=0)
.reset_index())
ID year number
0 A 2017 1
1 A 2018 0
2 A 2019 1
3 B 2017 1
4 B 2018 1
5 C 2016 1
6 C 2017 0
7 C 2018 0
8 C 2019 1