Combining separate ranges into largest possible contiguous ranges
Assumptions / Clarifications
- No need to differentiate between
infinity
and open upper bound (upper(range) IS NULL
). (You can have it either way, but it's simpler this way.)
- NULL vs.
infinity
in PostgreSQL range types
- Since
date
is a discrete type, all ranges have default[)
bounds. The manual:
The built-in range types
int4range
,int8range
, anddaterange
all use a canonical form that includes the lower bound and excludes the upper bound; that is,[)
.
For other types (like tsrange
!) I would enforce the same if possible:
- Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL
Solution with pure SQL
With CTEs for clarity:
WITH a AS (
SELECT range
, COALESCE(lower(range),'-infinity') AS startdate
, max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
FROM test
)
, b AS (
SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY range) AS grp
FROM b
)
SELECT daterange(min(startdate), max(enddate)) AS range
FROM c
GROUP BY grp
ORDER BY 1;
Or, the same with subqueries, faster but less easy too read:
SELECT daterange(min(startdate), max(enddate)) AS range
FROM (
SELECT *, count(step) OVER (ORDER BY range) AS grp
FROM (
SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
FROM (
SELECT range
, COALESCE(lower(range),'-infinity') AS startdate
, max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
FROM test
) a
) b
) c
GROUP BY grp
ORDER BY 1;
How?
a
: While ordering by range
, compute the running maximum of the upper bound (enddate
) with a window function.
Replace NULL bounds (unbounded) with +/- infinity
just to simplify (no special NULL cases).
b
: In the same sort order, if the previous enddate
is earlier than startdate
we have a gap and start a new range (step
).
Remember, the upper bound is always excluded.
c
: Form groups (grp
) by counting steps with another window function.
In the outer SELECT
build ranges from lower to upper bound in each group. Voilá.
Or with one less subquery level, but flipping sort order:
SELECT daterange(min(COALESCE(lower(range), '-infinity')), max(enddate)) AS range
FROM (
SELECT *, count(nextstart > enddate OR NULL) OVER (ORDER BY range DESC NULLS LAST) AS grp
FROM (
SELECT range
, max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
, lead(lower(range)) OVER (ORDER BY range) As nextstart
FROM test
) a
) b
GROUP BY grp
ORDER BY 1;
Sort the window in the second step with ORDER BY range DESC NULLS LAST
(with NULLS LAST
) to get perfectly inverted sort order. This should be cheaper (easier to produce, matches sort order of suggested index perfectly) and accurate for corner cases with rank IS NULL
. See:
- PostgreSQL sort by datetime asc, null first?
Related answer with more explanation:
- Compare multiple date ranges
Procedural solution with plpgsql
Works for any table / column name, but only for type daterange
.
Procedural solutions with loops are typically slower, but in this special case I expect the function to be substantially faster since it only needs a single sequential scan:
CREATE OR REPLACE FUNCTION f_range_agg(_tbl text, _col text)
RETURNS SETOF daterange AS
$func$
DECLARE
_lower date;
_upper date;
_enddate date;
_startdate date;
BEGIN
FOR _lower, _upper IN EXECUTE
format(
$sql$
SELECT COALESCE(lower(t.%2$I),'-infinity') -- replace NULL with ...
, COALESCE(upper(t.%2$I), 'infinity') -- ... +/- infinity
FROM %1$I t
ORDER BY t.%2$I
$sql$, _tbl, _col)
LOOP
IF _lower > _enddate THEN -- return previous range
RETURN NEXT daterange(_startdate, _enddate);
SELECT _lower, _upper INTO _startdate, _enddate;
ELSIF _upper > _enddate THEN -- expand range
_enddate := _upper;
-- do nothing if _upper <= _enddate (range already included) ...
ELSIF _enddate IS NULL THEN -- init 1st round
SELECT _lower, _upper INTO _startdate, _enddate;
END IF;
END LOOP;
IF FOUND THEN -- return last row
RETURN NEXT daterange(_startdate, _enddate);
END IF;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_range_agg('test', 'range'); -- table and column name
The logic is similar to the SQL solutions, but we can make do with a single pass.
SQL Fiddle.
Related:
- GROUP BY and aggregate sequential numeric values
The usual drill for handling user input in dynamic SQL:
- SQL injection in Postgres functions vs prepared queries
Index
For each of these solutions a plain (default) btree index on range
would be instrumental for performance in big tables:
CREATE INDEX foo on test (range);
A btree index is of limited use for range types, but we can get pre-sorted data and maybe even an index-only scan.
I've come up with this:
DO $$
DECLARE
i date;
a daterange := 'empty';
day_as_range daterange;
extreme_value date := '2100-12-31';
BEGIN
FOR i IN
SELECT DISTINCT
generate_series(
lower(range),
COALESCE(upper(range) - interval '1 day', extreme_value),
interval '1 day'
)::date
FROM rangetest
ORDER BY 1
LOOP
day_as_range := daterange(i, i, '[]');
BEGIN
IF isempty(a)
THEN a := day_as_range;
ELSE a = a + day_as_range;
END IF;
EXCEPTION WHEN data_exception THEN
RAISE INFO '%', a;
a = day_as_range;
END;
END LOOP;
IF upper(a) = extreme_value + interval '1 day'
THEN a := daterange(lower(a), NULL);
END IF;
RAISE INFO '%', a;
END;
$$;
Still needs a bit of honing, but the idea is the following:
- explode the ranges to individual dates
- doing this, replace the infinite upper bound with some extreme value
- based on the ordering from (1), start building the ranges
- when the union (
+
) fails, return the already built range and reinitialize - finally, return the rest - if the predefined extreme value is reached, replace it with NULL to get an infinite upper bound
Some years ago I tested different solutions (amongst others some similar to those from @ErwinBrandstetter) for merging overlapping periods on a Teradata system and I found the following the most efficient one (using Analytical Functions, newer version of Teradata have built-in functions for that task).
- sort the rows by start date
- find the maximum end date of all previous rows:
maxEnddate
- if this date is less than the current start date, you found a gap. Only keep those rows plus the first row within the PARTITION (which is indicated by a NULL) and filter all other rows. Now you get the start date for each range and the end date of the previous range.
- Then you simply get the next row's
maxEnddate
usingLEAD
and you're almost done. Only for the last rowLEAD
returns aNULL
, to solve this calculate the maximum end date of all rows of a partition in step 2 andCOALESCE
it.
Why it was faster? Depending on the actual data step #2 might greatly reduce the number of rows, so the next step needs to operate on a small subset only, additionally it removes aggregation.
fiddle
SELECT
daterange(startdate
,COALESCE(LEAD(maxPrevEnddate) -- next row's end date
OVER (ORDER BY startdate)
,maxEnddate) -- or maximum end date
) AS range
FROM
(
SELECT
range
,COALESCE(LOWER(range),'-infinity') AS startdate
-- find the maximum end date of all previous rows
-- i.e. the END of the previous range
,MAX(COALESCE(UPPER(range), 'infinity'))
OVER (ORDER BY range
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS maxPrevEnddate
-- maximum end date of this partition
-- only needed for the last range
,MAX(COALESCE(UPPER(range), 'infinity'))
OVER () AS maxEnddate
FROM test
) AS dt
WHERE maxPrevEnddate < startdate -- keep the rows where a range start
OR maxPrevEnddate IS NULL -- and keep the first row
ORDER BY 1;
As this was fastest on Teradata, I don't know if it's the same for PostgreSQL, would be nice to get some actual performance numbers.