Combining separate ranges into largest possible contiguous ranges

Assumptions / Clarifications

No need to differentiate between infinity and open upper bound (upper(range) IS NULL). (You can have it either way, but it's simpler this way.)

NULL vs. infinity in PostgreSQL range types

Since date is a discrete type, all ranges have default [) bounds. The manual:

The built-in range types int4range, int8range, and daterange all use a canonical form that includes the lower bound and excludes the upper bound; that is, [).

For other types (like tsrange!) I would enforce the same if possible:

Preventing adjacent/overlapping entries with EXCLUDE in PostgreSQL

Solution with pure SQL

With CTEs for clarity:

WITH a AS (
   SELECT range
        , COALESCE(lower(range),'-infinity') AS startdate
        , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
   FROM   test
   )
, b AS (
   SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
   FROM   a
   )
, c AS (
   SELECT *, count(step) OVER (ORDER BY range) AS grp
   FROM   b
   )
SELECT daterange(min(startdate), max(enddate)) AS range
FROM   c
GROUP  BY grp
ORDER  BY 1;

Or, the same with subqueries, faster but less easy too read:

SELECT daterange(min(startdate), max(enddate)) AS range
FROM  (
   SELECT *, count(step) OVER (ORDER BY range) AS grp
   FROM  (
      SELECT *, lag(enddate) OVER (ORDER BY range) < startdate OR NULL AS step
      FROM  (
         SELECT range
              , COALESCE(lower(range),'-infinity') AS startdate
              , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
         FROM   test
         ) a
      ) b
   ) c
GROUP  BY grp
ORDER  BY 1;

How?

a: While ordering by range, compute the running maximum of the upper bound (enddate) with a window function.
Replace NULL bounds (unbounded) with +/- infinity just to simplify (no special NULL cases).

b: In the same sort order, if the previous enddate is earlier than startdate we have a gap and start a new range (step).
_{Remember, the upper bound is always excluded.}

c: Form groups (grp) by counting steps with another window function.

In the outer SELECT build ranges from lower to upper bound in each group. Voilá.

Or with one less subquery level, but flipping sort order:

SELECT daterange(min(COALESCE(lower(range), '-infinity')), max(enddate)) AS range
FROM  (
   SELECT *, count(nextstart > enddate OR NULL) OVER (ORDER BY range DESC NULLS LAST) AS grp
   FROM  (
      SELECT range
           , max(COALESCE(upper(range), 'infinity')) OVER (ORDER BY range) AS enddate
           , lead(lower(range)) OVER (ORDER BY range) As nextstart
      FROM   test
      ) a
   ) b
GROUP  BY grp
ORDER  BY 1;

Sort the window in the second step with ORDER BY range DESC NULLS LAST (with NULLS LAST) to get perfectly inverted sort order. This should be cheaper (easier to produce, matches sort order of suggested index perfectly) and accurate for corner cases with rank IS NULL. See:

PostgreSQL sort by datetime asc, null first?

Procedural solution with plpgsql

Works for any table / column name, but only for type daterange.
Procedural solutions with loops are typically slower, but in this special case I expect the function to be substantially faster since it only needs a single sequential scan:

CREATE OR REPLACE FUNCTION f_range_agg(_tbl text, _col text)
  RETURNS SETOF daterange AS
$func$
DECLARE
   _lower     date;
   _upper     date;
   _enddate   date;
   _startdate date;
BEGIN
   FOR _lower, _upper IN EXECUTE
      format(
         $sql$
         SELECT COALESCE(lower(t.%2$I),'-infinity')  -- replace NULL with ...
              , COALESCE(upper(t.%2$I), 'infinity')  -- ... +/- infinity
         FROM   %1$I t
         ORDER  BY t.%2$I
         $sql$, _tbl, _col)
   LOOP
      IF _lower > _enddate THEN     -- return previous range
         RETURN NEXT daterange(_startdate, _enddate);
         SELECT _lower, _upper  INTO _startdate, _enddate;
   
      ELSIF _upper > _enddate THEN  -- expand range
         _enddate := _upper;
   
      -- do nothing if _upper <= _enddate (range already included) ...
   
      ELSIF _enddate IS NULL THEN   -- init 1st round
         SELECT _lower, _upper  INTO _startdate, _enddate;
      END IF;
   END LOOP;
   
   IF FOUND THEN                    -- return last row
      RETURN NEXT daterange(_startdate, _enddate);
   END IF;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_range_agg('test', 'range');  -- table and column name

The logic is similar to the SQL solutions, but we can make do with a single pass.

SQL Fiddle.

GROUP BY and aggregate sequential numeric values

The usual drill for handling user input in dynamic SQL:

SQL injection in Postgres functions vs prepared queries

Index

For each of these solutions a plain (default) btree index on range would be instrumental for performance in big tables:

CREATE INDEX foo on test (range);

A btree index is of limited use for range types, but we can get pre-sorted data and maybe even an index-only scan.

I've come up with this:

DO $$                                                                             
DECLARE 
    i date;
    a daterange := 'empty';
    day_as_range daterange;
    extreme_value date := '2100-12-31';
BEGIN
    FOR i IN 
        SELECT DISTINCT 
             generate_series(
                 lower(range), 
                 COALESCE(upper(range) - interval '1 day', extreme_value), 
                 interval '1 day'
             )::date
        FROM rangetest 
        ORDER BY 1
    LOOP
        day_as_range := daterange(i, i, '[]');
        BEGIN
            IF isempty(a)
            THEN a := day_as_range;
            ELSE a = a + day_as_range;
            END IF;
        EXCEPTION WHEN data_exception THEN
            RAISE INFO '%', a;
            a = day_as_range;
        END;
    END LOOP;

    IF upper(a) = extreme_value + interval '1 day'
    THEN a := daterange(lower(a), NULL);
    END IF;

    RAISE INFO '%', a;
END;
$$;

Still needs a bit of honing, but the idea is the following:

explode the ranges to individual dates
doing this, replace the infinite upper bound with some extreme value
based on the ordering from (1), start building the ranges
when the union (+) fails, return the already built range and reinitialize
finally, return the rest - if the predefined extreme value is reached, replace it with NULL to get an infinite upper bound

Some years ago I tested different solutions (amongst others some similar to those from @ErwinBrandstetter) for merging overlapping periods on a Teradata system and I found the following the most efficient one (using Analytical Functions, newer version of Teradata have built-in functions for that task).

sort the rows by start date
find the maximum end date of all previous rows: maxEnddate
if this date is less than the current start date, you found a gap. Only keep those rows plus the first row within the PARTITION (which is indicated by a NULL) and filter all other rows. Now you get the start date for each range and the end date of the previous range.
Then you simply get the next row's maxEnddate using LEAD and you're almost done. Only for the last row LEAD returns a NULL, to solve this calculate the maximum end date of all rows of a partition in step 2 and COALESCE it.

Why it was faster? Depending on the actual data step #2 might greatly reduce the number of rows, so the next step needs to operate on a small subset only, additionally it removes aggregation.

fiddle

SELECT
   daterange(startdate
            ,COALESCE(LEAD(maxPrevEnddate) -- next row's end date
                      OVER (ORDER BY startdate) 
                     ,maxEnddate)          -- or maximum end date
            ) AS range

FROM
 (
   SELECT
      range
     ,COALESCE(LOWER(range),'-infinity') AS startdate

   -- find the maximum end date of all previous rows
   -- i.e. the END of the previous range
     ,MAX(COALESCE(UPPER(range), 'infinity'))
      OVER (ORDER BY range
            ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS maxPrevEnddate

   -- maximum end date of this partition
   -- only needed for the last range
     ,MAX(COALESCE(UPPER(range), 'infinity'))
      OVER () AS maxEnddate
   FROM test
 ) AS dt
WHERE maxPrevEnddate < startdate -- keep the rows where a range start
   OR maxPrevEnddate IS NULL     -- and keep the first row
ORDER BY 1;

As this was fastest on Teradata, I don't know if it's the same for PostgreSQL, would be nice to get some actual performance numbers.

Combining separate ranges into largest possible contiguous ranges

Assumptions / Clarifications

Solution with pure SQL

Procedural solution with plpgsql

Index

Tags:

Postgresql

Aggregate

Range Types

Related

Recent Posts