Fastest way to count how many dates' ranges cover each date from series
And how to include pairs date-kind with 0 count?
Build a grid of all combinations then LATERAL
join to your table, like this:
SELECT k.kind, d.as_of_date, c.n
FROM (SELECT DISTINCT kind FROM dates_ranges) k
CROSS JOIN (
SELECT d::date AS as_of_date
FROM generate_series(timestamp '2018-01-01', timestamp '2018-01-03', interval '1 day') d
) d
CROSS JOIN LATERAL (
SELECT count(*)::int AS n
FROM dates_ranges
WHERE kind = k.kind
AND d.as_of_date BETWEEN start_date AND end_date
) c
ORDER BY k.kind, d.as_of_date;
Should also be as fast as possible.
I had LEFT JOIN LATERAL ... on true
at first, but there is an aggregate in the subquery c
, so we always get a row and can use CROSS JOIN
as well. No difference in performance.
If you have a table holding all relevant kinds, use that instead of generating the list with subquery k
.
The cast to integer
is optional. Else you get bigint
.
Indexes would help, especially a multicolumn index on (kind, start_date, end_date)
. Since you are building on a subquery, this may or may not be possible to achieve.
Using set-returning functions like generate_series()
in the SELECT
list is generally not advisable in Postgres versions before 10 (unless you know exactly what you are doing). See:
- What is the expected behaviour for multiple set-returning functions in select clause?
If you have lots of combinations with few or no rows, this equivalent form may be faster:
SELECT k.kind, d.as_of_date, count(dr.kind)::int AS n
FROM (SELECT DISTINCT kind FROM dates_ranges) k
CROSS JOIN (
SELECT d::date AS as_of_date
FROM generate_series(timestamp '2018-01-01', timestamp '2018-01-03', interval '1 day') d
) d
LEFT JOIN dates_ranges dr ON dr.kind = k.kind
AND d.as_of_date BETWEEN dr.start_date AND dr.end_date
GROUP BY 1, 2
ORDER BY 1, 2;
The following query also works if "missing zeros" are OK:
select *
from (
select
kind,
generate_series(start_date, end_date, interval '1 day')::date as d,
count(*)
from dates_ranges
group by 1, 2
) x
where d between date '2018-01-01' and date '2018-01-03'
order by 1, 2;
but it's not any faster than the lateral
version with the small dataset. It might scale better though, as no join is required, but the above version aggregates over all the rows, so it may lose out there again.
The following query tries to avoid unnecessary work by removing any series that don't overlap anyway:
select
kind,
generate_series(greatest(start_date, date '2018-01-01'), least(end_date, date '2018-01-03'), interval '1 day')::date as d,
count(*)
from dates_ranges
where (start_date, end_date + interval '1 day') overlaps (date '2018-01-01', date '2018-01-03' + interval '1 day')
group by 1, 2
order by 1, 2;
-- and I got to use the overlaps
operator! Note that you have to add interval '1 day'
to the right as the overlaps operator considers time periods to be open on the right (which is fairly logical because a date is often considered to be a timestamp with time component of midnight).