Find and sum date ranges with overlapping records in postgresql
demo:db<>fiddle (uses the old data set with the overlapping A-B-part)
Disclaimer: This works for day intervals not for timestamps. The requirement for ts came later.
SELECT
s.acts,
s.sum,
MIN(a.start) as start,
MAX(a.end) as end
FROM (
SELECT DISTINCT ON (acts)
array_agg(name) as acts,
SUM(count)
FROM
activities, generate_series(start, "end", interval '1 day') gs
GROUP BY gs
HAVING cardinality(array_agg(name)) > 1
) s
JOIN activities a
ON a.name = ANY(s.acts)
GROUP BY s.acts, s.sum
generate_series
generates all dates between start and end. So every date an activity exists gets one row with the specificcount
- Grouping all dates, aggregating all existing activities and sum of their counts
HAVING
filters out the dates where only one activity exist- Because there are different days with the same activities we only need one representant: Filter all duplicates with
DISTINCT ON
- Join this result against the original table to get the start and end. (note that "end" is a reserved word in Postgres, you should better find another column name!). It was more comfortable to lose them before but its possible to get these data within the subquery.
- Group this join to get the most early and latest date of each interval.
Here's a version for timestamps:
demo:db<>fiddle
WITH timeslots AS (
SELECT * FROM (
SELECT
tsrange(timepoint, lead(timepoint) OVER (ORDER BY timepoint)),
lead(timepoint) OVER (ORDER BY timepoint) -- 2
FROM (
SELECT
unnest(ARRAY[start, "end"]) as timepoint -- 1
FROM
activities
ORDER BY timepoint
) s
)s WHERE lead IS NOT NULL -- 3
)
SELECT
GREATEST(MAX(start), lower(tsrange)), -- 6
LEAST(MIN("end"), upper(tsrange)),
array_agg(name), -- 5
sum(count)
FROM
timeslots t
JOIN activities a
ON t.tsrange && tsrange(a.start, a.end) -- 4
GROUP BY tsrange
HAVING cardinality(array_agg(name)) > 1
The main idea is to identify possible time slots. So I take every known time (both start and end) and put them into a sorted list. So I can take the first tow known times (17:00 from start A and 18:00 from start B) and check which interval is in it. Then I check it for the 2nd and 3rd, then for 3rd an 4th and so on.
In the first timeslot only A fits. In the second from 18-19 also B is fitting. In the next slot 19-20 also C, from 20 to 20:30 A isn't fitting anymore, only B and C. The next one is 20:30-22 where only B fits, finally 22-23 D is added to B and last but not least only D fits into 23-23:30.
So I take this time list and join it agains the activities table where the intervals intersect. After that its only a grouping by time slot and sum up your count.
- this puts both ts of a row into one array whose elements are expanded into one row per element with
unnest
. So I get all times into one column which can be simply ordered - using the lead window function allows to take the value of the next row into the current one. So I can create a timestamp range out of these both values with
tsrange
- This filter is necessary because the last row has no "next value". This creates a
NULL
value which is interpreted bytsrange
as infinity. So this would create an incredible wrong time slot. So we need to filter this row out. - Join the time slots against the original table. The
&&
operator checks if two range types overlap. - Grouping by single time slots, aggregating the names and the count. Filter out the time slots with only one activity by using the
HAVING
clause - A little bit tricky to get the right start and end points. So the start points are either the maximum of the activity start or the beginning of a time slot (which can be get using
lower
). E.g. Take the 20-20:30 slot: It begins 20h but neither B nor C has its starting point there. Similar the end time.