Slow pandas DataFrame MultiIndex reindex

Another solution with reindex without using explode:

result = (df.groupby(["id","start_time"])
          .apply(lambda d: d.set_index("sequence_no")
          .reindex(range(min(d["sequence_no"]),max(d["sequence_no"])+1)))
          .drop(["id","start_time"],axis=1).reset_index()
          .interpolate())

print (result)

#
    id                 start_time  sequence_no  value
0   71  2018-10-17 20:12:43+00:00       114428    3.0
1   71  2018-10-17 20:12:43+00:00       114429    3.0
2   71  2018-10-17 20:12:43+00:00       114430   41.0
3   71  2018-10-17 20:12:43+00:00       114431   79.0
4   71  2019-11-06 00:51:14+00:00       216009  100.0
5   71  2019-11-06 00:51:14+00:00       216010  125.0
6   71  2019-11-06 00:51:14+00:00       216011  150.0
7   71  2019-11-06 00:51:14+00:00       216012  165.0
8   71  2019-11-06 00:51:14+00:00       216013  180.0
9   92  2019-12-01 00:51:14+00:00       114430   19.0
10  92  2019-12-01 00:51:14+00:00       114431   39.0
11  92  2019-12-01 00:51:14+00:00       114432   59.0
12  92  2019-12-01 00:51:14+00:00       114433   79.0
13  92  2019-12-01 00:51:14+00:00       114434  100.0

using merge instead of reindex may speed things up. Also, using map instead of the list comprehension may as well.

Click to copy

# Generate dummy data
df = pd.DataFrame([
    (71, '2018-10-17 20:12:43+00:00', 114428, 3),
    (71, '2018-10-17 20:12:43+00:00', 114429, 3),
    (71, '2018-10-17 20:12:43+00:00', 114431, 79),
    (71, '2019-11-06 00:51:14+00:00', 216009, 100),
    (71, '2019-11-06 00:51:14+00:00', 216011, 150),
    (71, '2019-11-06 00:51:14+00:00', 216013, 180),
    (92, '2019-12-01 00:51:14+00:00', 114430, 19),
    (92, '2019-12-01 00:51:14+00:00', 114433, 79),
    (92, '2019-12-01 00:51:14+00:00', 114434, 100),   
], columns=['id', 'start_time', 'sequence_no', 'value'])

# create a ranges df with groupby and agg
ranges = df.groupby(['start_time', 'id'])['sequence_no'].agg([('sequence_min', np.min), ('sequence_max', np.max)])
# map with range to create the sequence number rnage
ranges['sequence_no'] = list(map(lambda x,y: range(x,y), ranges.pop('sequence_min'), ranges.pop('sequence_max')+1))
# explode you DataFrame
new_df = ranges.explode('sequence_no')
# merge new_df and df
merge = new_df.reset_index().merge(df, on=['start_time', 'id', 'sequence_no'], how='left')
# interpolate and assign values 
merge['value'] = merge['value'].interpolate()

                   start_time  id sequence_no  value
0   2018-10-17 20:12:43+00:00  71      114428    3.0
1   2018-10-17 20:12:43+00:00  71      114429    3.0
2   2018-10-17 20:12:43+00:00  71      114430   41.0
3   2018-10-17 20:12:43+00:00  71      114431   79.0
4   2019-11-06 00:51:14+00:00  71      216009  100.0
5   2019-11-06 00:51:14+00:00  71      216010  125.0
6   2019-11-06 00:51:14+00:00  71      216011  150.0
7   2019-11-06 00:51:14+00:00  71      216012  165.0
8   2019-11-06 00:51:14+00:00  71      216013  180.0
9   2019-12-01 00:51:14+00:00  92      114430   19.0
10  2019-12-01 00:51:14+00:00  92      114431   39.0
11  2019-12-01 00:51:14+00:00  92      114432   59.0
12  2019-12-01 00:51:14+00:00  92      114433   79.0
13  2019-12-01 00:51:14+00:00  92      114434  100.0

A shorter version of the merge solution:

Click to copy

df.groupby(['start_time', 'id'])['sequence_no']\
.apply(lambda x: np.arange(x.min(), x.max() + 1))\
.explode().reset_index()\
.merge(df, on=['start_time', 'id', 'sequence_no'], how='left')\
.interpolate()

Output:

Click to copy

                   start_time  id sequence_no  value
0   2018-10-17 20:12:43+00:00  71      114428    3.0
1   2018-10-17 20:12:43+00:00  71      114429    3.0
2   2018-10-17 20:12:43+00:00  71      114430   41.0
3   2018-10-17 20:12:43+00:00  71      114431   79.0
4   2019-11-06 00:51:14+00:00  71      216009  100.0
5   2019-11-06 00:51:14+00:00  71      216010  125.0
6   2019-11-06 00:51:14+00:00  71      216011  150.0
7   2019-11-06 00:51:14+00:00  71      216012  165.0
8   2019-11-06 00:51:14+00:00  71      216013  180.0
9   2019-12-01 00:51:14+00:00  92      114430   19.0
10  2019-12-01 00:51:14+00:00  92      114431   39.0
11  2019-12-01 00:51:14+00:00  92      114432   59.0
12  2019-12-01 00:51:14+00:00  92      114433   79.0
13  2019-12-01 00:51:14+00:00  92      114434  100.0

Slow pandas DataFrame MultiIndex reindex

Tags:

Python

Pandas

Numpy

Dataframe

Related

Recent Posts