Pandas groupby apply performing slow

The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will be magnified. You could probably use a vectorized operation rather than a for loop in your function to save time, but a much easier way to shave off a few seconds is to return 0 rather than return group. When you return group, pandas will actually create a new data object combining your sorted groups, which you don't appear to use. When you return 0, pandas will combine 5300 zeros instead, which is much faster.

For example:

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
groups = df.groupby(cols)
print(len(groups))
# 5353

%timeit df.groupby(cols).apply(lambda group: group)
# 1 loops, best of 3: 2.41 s per loop

%timeit df.groupby(cols).apply(lambda group: 0)
# 10 loops, best of 3: 64.3 ms per loop

Just combining the results you don't use is taking about 2.4 seconds; the rest of the time is actual computation in your loop which you should attempt to vectorize.

Edit:

With a quick additional vectorized check before the for loop and returning 0 instead of group, I got the time down to about ~2sec, which is basically the cost of sorting each group. Try this function:

def Full_coverage(group):
    if len(group) > 1:
        group = group.sort('SectionStart', ascending=True)

        # this condition is sufficient to find when the loop
        # will add to the list
        if np.any(group.values[1:, 4] != group.values[:-1, 5]):
            start_km = group.iloc[0,4]
            end_km = group.iloc[0,5]
            end_km_index = group.index[0]

            for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
                if i != end_km:
                    incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                        'Found startpoint: '+str(i)+' (row '+str(index)+')'))                
                start_km = i
                end_km = j
                end_km_index = index

    return 0

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 1.74 s per loop

Edit 2: here's an example which incorporates my suggestion to move the sort outside the groupby and to remove the unnecessary loops. Removing the loops is not much faster for the given example, but will be faster if there are a lot of incompletes:

def Full_coverage_new(group):
    if len(group) > 1:
        mask = group.values[1:, 4] != group.values[:-1, 5]
        if np.any(mask):
            err = ('Expected startpoint: {0} (row {1}) '
                   'Found startpoint: {2} (row {3})')
            incomplete_coverage.extend([err.format(group.iloc[i, 5],
                                                   group.index[i],
                                                   group.iloc[i + 1, 4],
                                                   group.index[i + 1])
                                        for i in np.where(mask)[0]])
    return 0

incomplete_coverage = []
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
df_s = df.sort_values(['SectionStart','SectionStop'])
df_s.groupby(cols).apply(Full_coverage_nosort)

Pandas groupby apply performing slow

Tags:

Python

Pandas

Python 3.X

Related

Recent Posts