Gaps And Islands: Splitting Islands Based On External Table

Here is one more variant that is likely to perform better than my first answer. I decided to put it as a second answer, because the approach is rather different and the answer would be too long. You should compare performance of all variants with your real data on your hardware, and don't forget about indexes.

In the first variant I was using APPLY to pick a relevant price for each row in the History table. For each row from the History table the engine is searching for a relevant row from the PriceChange table. Even with appropriate index on the PriceChange table when this is done via a single seek, it still means 3.7 million seeks in a loop join.

We can simply join History and PriceChange tables together and with appropriate indexes on both tables it will be an efficient merge join.

Here I'm also using an extended sample data set to illustrate the gaps. I added these rows to the sample data from the question.

INSERT INTO History (ProductId, DestinationId, ScheduledDate, Quantity)
VALUES
  (0, 1000, '20180601', 5),
  (0, 1000, '20180602', 10),
  (0, 1000, '20180603', 7),
  (3, 5000, '20180607', 15),
  (3, 5000, '20180608', 23),
  (3, 5000, '20180609', 52),
  (3, 5000, '20180610', 12),
  (3, 5000, '20180611', 14);

Intermediate query

We do a FULL JOIN here, not a LEFT JOIN because it is possible that the date on which the price changed doesn't appear in the History table at all.

WITH
CTE_Join
AS
(
    SELECT
        ISNULL(History.ProductId, PriceChange.ProductID) AS ProductID
        ,ISNULL(History.DestinationId, PriceChange.DestinationId) AS DestinationId
        ,ISNULL(History.ScheduledDate, PriceChange.EffectiveDate) AS ScheduledDate
        ,History.Quantity
        ,PriceChange.Price
    FROM
        History
        FULL JOIN PriceChange
            ON  PriceChange.ProductID = History.ProductID
            AND PriceChange.DestinationId = History.DestinationId
            AND PriceChange.EffectiveDate = History.ScheduledDate
)
,CTE2
AS
(
    SELECT
        ProductID
        ,DestinationId
        ,ScheduledDate
        ,Quantity
        ,Price
        ,MAX(CASE WHEN Price IS NOT NULL THEN ScheduledDate END)
            OVER (PARTITION BY ProductID, DestinationId ORDER BY ScheduledDate 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
    FROM CTE_Join
)
SELECT *
FROM CTE2
ORDER BY
    ProductID
    ,DestinationId
    ,ScheduledDate

Create the following indexes

CREATE UNIQUE NONCLUSTERED INDEX [IX_History] ON [dbo].[History]
(
    [ProductId] ASC,
    [DestinationId] ASC,
    [ScheduledDate] ASC
)
INCLUDE ([Quantity])

CREATE UNIQUE NONCLUSTERED INDEX [IX_Price] ON [dbo].[PriceChange]
(
    [ProductId] ASC,
    [DestinationId] ASC,
    [EffectiveDate] ASC
)
INCLUDE ([Price])

and the join will be an efficient MERGE join in the execution plan (not a LOOP join)

merge join

Intermediate result

+-----------+---------------+---------------+----------+-------+------------+
| ProductID | DestinationId | ScheduledDate | Quantity | Price |    grp     |
+-----------+---------------+---------------+----------+-------+------------+
|         0 |          1000 | 2018-02-01    | NULL     | 1     | 2018-02-01 |
|         0 |          1000 | 2018-04-01    | 5        | NULL  | 2018-02-01 |
|         0 |          1000 | 2018-04-02    | 10       | 2     | 2018-04-02 |
|         0 |          1000 | 2018-04-03    | 7        | NULL  | 2018-04-02 |
|         0 |          1000 | 2018-06-01    | 5        | NULL  | 2018-04-02 |
|         0 |          1000 | 2018-06-02    | 10       | NULL  | 2018-04-02 |
|         0 |          1000 | 2018-06-03    | 7        | NULL  | 2018-04-02 |
|         3 |          5000 | 2018-01-01    | NULL     | 5     | 2018-01-01 |
|         3 |          5000 | 2018-05-07    | 15       | NULL  | 2018-01-01 |
|         3 |          5000 | 2018-05-08    | 23       | NULL  | 2018-01-01 |
|         3 |          5000 | 2018-05-09    | 52       | NULL  | 2018-01-01 |
|         3 |          5000 | 2018-05-10    | 12       | 20    | 2018-05-10 |
|         3 |          5000 | 2018-05-11    | 14       | NULL  | 2018-05-10 |
|         3 |          5000 | 2018-06-07    | 15       | NULL  | 2018-05-10 |
|         3 |          5000 | 2018-06-08    | 23       | NULL  | 2018-05-10 |
|         3 |          5000 | 2018-06-09    | 52       | NULL  | 2018-05-10 |
|         3 |          5000 | 2018-06-10    | 12       | NULL  | 2018-05-10 |
|         3 |          5000 | 2018-06-11    | 14       | NULL  | 2018-05-10 |
+-----------+---------------+---------------+----------+-------+------------+

You can see that the Price column has a lot of NULL values. We need to "fill" these NULL values with the preceding non-NULL value.

Itzik Ben-Gan wrote a nice article showing how to solve this efficiently The Last non NULL Puzzle. Also see Best way to replace NULL with most recent non-null value.

This is done in CTE2 using MAX window function and you can see how it populates the grp column. This requires SQL Server 2012+. After the groups are determined we should remove rows where Quantity is NULL, because these rows are not from the History table.

Now we can do the same gaps-and-islands step using the grp column as an additional partitioning.

The rest of the query is pretty much the same as in the first variant.

Final query

WITH
CTE_Join
AS
(
    SELECT
        ISNULL(History.ProductId, PriceChange.ProductID) AS ProductID
        ,ISNULL(History.DestinationId, PriceChange.DestinationId) AS DestinationId
        ,ISNULL(History.ScheduledDate, PriceChange.EffectiveDate) AS ScheduledDate
        ,History.Quantity
        ,PriceChange.Price
    FROM
        History
        FULL JOIN PriceChange
            ON  PriceChange.ProductID = History.ProductID
            AND PriceChange.DestinationId = History.DestinationId
            AND PriceChange.EffectiveDate = History.ScheduledDate
)
,CTE2
AS
(
    SELECT
        ProductID
        ,DestinationId
        ,ScheduledDate
        ,Quantity
        ,Price
        ,MAX(CASE WHEN Price IS NOT NULL THEN ScheduledDate END)
            OVER (PARTITION BY ProductID, DestinationId ORDER BY ScheduledDate 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp
    FROM CTE_Join
)
,CTE_RN
AS
(
    SELECT
        ProductID
        ,DestinationId
        ,ScheduledDate
        ,grp
        ,Quantity
        ,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, grp ORDER BY ScheduledDate) AS rn1
        ,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
    FROM CTE2
    WHERE Quantity IS NOT NULL
)
SELECT
    ProductId
    ,DestinationId
    ,MIN(ScheduledDate) AS StartDate
    ,MAX(ScheduledDate) AS EndDate
    ,SUM(Quantity) AS TotalQuantity
FROM
    CTE_RN
GROUP BY
    ProductId
    ,DestinationId
    ,grp
    ,rn2-rn1
ORDER BY
    ProductID
    ,DestinationId
    ,StartDate
;

Final result

+-----------+---------------+------------+------------+---------------+
| ProductId | DestinationId | StartDate  |  EndDate   | TotalQuantity |
+-----------+---------------+------------+------------+---------------+
|         0 |          1000 | 2018-04-01 | 2018-04-01 |             5 |
|         0 |          1000 | 2018-04-02 | 2018-04-03 |            17 |
|         0 |          1000 | 2018-06-01 | 2018-06-03 |            22 |
|         3 |          5000 | 2018-05-07 | 2018-05-09 |            90 |
|         3 |          5000 | 2018-05-10 | 2018-05-11 |            26 |
|         3 |          5000 | 2018-06-07 | 2018-06-11 |           116 |
+-----------+---------------+------------+------------+---------------+

This variant doesn't output the relevant price (as the first variant), because I simplified the "last non-null" query. It wasn't required in the question. In any case, it is pretty easy to add the price if needed.

Not sure that i understand correctly, but this is just my idea:

Select concat_ws(',',view2.StartDate,  string_agg(view1.splitDate, ','), 
 view2.EndDate), view2.productId, view2.DestinationId from (
 SELECT DENSE_RANK() OVER (ORDER BY EffectiveDate) as Rank, EffectiveDate as 
  SplitDate FROM PriceChange GROUP BY EffectiveDate) view1 join 
 (
     SELECT MIN(ScheduledDate) as StartDate, MAX(ScheduledDate) as 
       EndDate,ProductId, DestinationId, SUM(Quantity) as TotalQuantity
     FROM (
      SELECT ScheduledDate, DestinationId, ProductId, PartitionGroup = 
      DATEADD(DAY ,-1 * DENSE_RANK() OVER (ORDER BY ScheduledDate), 
       ScheduledDate), Quantity
       FROM History
   ) tmp
      GROUP BY PartitionGroup, DestinationId, ProductId
    ) view2 on view1.SplitDate >= view2.StartDate 
      and view1.SplitDate <=view2.EndDate 
      group by view2.startDate, view2.endDate, view2.productId, 
      view2.DestinationId

The result from this query will be:

| ranges                                      | productId | DestinationId |
|---------------------------------------------|-----------|---------------|
| 2018-04-01,2018-04-02,2018-04-03            | 0         | 1000          |
| 2018-05-07,2018-05-10,2018-05-11            | 3         | 5000          |

Then, with any procedure language, for each row, you can split the string (with appropriate inclusive or exclusive rule for each boundary) to find out a list of condition (:from, :to, :productId, :destinationId).

And finally, you can loop through the list of conditions and use Union all clause to build one query (which is the union of all queries, which states a condition) to find out the final result. For example,

Select * from History where ScheduledDate >= '2018-04-01' and ScheduledDate <'2018-04-02' and productId = 0 and destinationId = 1000 
union all
Select * from History where ScheduledDate >= '2018-04-02' and ScheduledDate <'2018-04-03' and productId = 0 and destinationId = 1000

----Update--------

Just based on above idea, i do some quick changes to provide your resultset. Maybe you can optimize it later

 with view3 as 
(Select concat_ws(',',view2.StartDate,  string_agg(view1.splitDate, ','), 
 dateadd(day, 1, view2.EndDate)) dateRange, view2.productId, view2.DestinationId from (
 SELECT DENSE_RANK() OVER (ORDER BY EffectiveDate) as Rank, EffectiveDate as 
  SplitDate FROM PriceChange GROUP BY EffectiveDate) view1 join 
 (
     SELECT MIN(ScheduledDate) as StartDate, MAX(ScheduledDate) as 
       EndDate,ProductId, DestinationId, SUM(Quantity) as TotalQuantity
     FROM (
      SELECT ScheduledDate, DestinationId, ProductId, PartitionGroup = 
      DATEADD(DAY ,-1 * DENSE_RANK() OVER (ORDER BY ScheduledDate), 
       ScheduledDate), Quantity
       FROM History
   ) tmp
      GROUP BY PartitionGroup, DestinationId, ProductId
    ) view2 on view1.SplitDate >= view2.StartDate 
      and view1.SplitDate <=view2.EndDate 
      group by view2.startDate, view2.endDate, view2.productId, 
      view2.DestinationId
),
 view4 as
(
select productId, destinationId, value from view3 cross apply string_split(dateRange, ',')
 ),
 view5 as(
   select *, row_number() over(partition by productId, destinationId order by value) rn from view4
 ),
 view6 as (
   select v52.value fr, v51.value t, v51.productid, v51. destinationid from view5 v51 join view5 v52
 on v51.productid = v52.productid
 and v51.destinationid = v52.destinationid
 and v51.rn = v52.rn+1
 )
 select min(h.ScheduledDate) StartDate, max(h.ScheduledDate) EndDate, v6.productId, v6.destinationId, sum(h.quantity) TotalQuantity from view6 v6 join History h 
 on v6.destinationId = h.destinationId
 and v6.productId = h.productId
 and h.ScheduledDate >= v6.fr
 and h.ScheduledDate <v6.t
 group by v6.fr, v6.t, v6.productId, v6.destinationId

And the result is exactly the same with what you gave.

| StartDate  | EndDate    | productId | destinationId | TotalQuantity |
|------------|------------|-----------|---------------|---------------|
| 2018-04-01 | 2018-04-01 | 0         | 1000          | 5             |
| 2018-04-02 | 2018-04-03 | 0         | 1000          | 17            |
| 2018-05-07 | 2018-05-09 | 3         | 5000          | 90            |
| 2018-05-10 | 2018-05-11 | 3         | 5000          | 26            |

The straight-forward method is to fetch the effective price for each row of History and then generate gaps and islands taking price into account.

It is not clear from the question what is the role of DestinationID. Sample data is of no help here. I'll assume that we need to join and partition on both ProductID and DestinationID.

The following query returns effective Price for each row from History. You need to add index to the PriceChange table

CREATE NONCLUSTERED INDEX [IX] ON [dbo].[PriceChange]
(
    [ProductId] ASC,
    [DestinationId] ASC,
    [EffectiveDate] DESC
)
INCLUDE ([Price])

for this query to work efficiently.

Query for Prices

SELECT
    History.ProductId
    ,History.DestinationId
    ,History.ScheduledDate
    ,History.Quantity
    ,A.Price
FROM
    History
    OUTER APPLY
    (
        SELECT TOP(1)
            PriceChange.Price
        FROM
            PriceChange
        WHERE
            PriceChange.ProductID = History.ProductID
            AND PriceChange.DestinationId = History.DestinationId
            AND PriceChange.EffectiveDate <= History.ScheduledDate
        ORDER BY
            PriceChange.EffectiveDate DESC
    ) AS A
ORDER BY ProductID, ScheduledDate;

For each row from History there will be one seek in this index to pick the correct price.

This query returns:

Prices

+-----------+---------------+---------------+----------+-------+
| ProductId | DestinationId | ScheduledDate | Quantity | Price |
+-----------+---------------+---------------+----------+-------+
|         0 |          1000 | 2018-04-01    |        5 |     1 |
|         0 |          1000 | 2018-04-02    |       10 |     2 |
|         0 |          1000 | 2018-04-03    |        7 |     2 |
|         3 |          5000 | 2018-05-07    |       15 |     5 |
|         3 |          5000 | 2018-05-08    |       23 |     5 |
|         3 |          5000 | 2018-05-09    |       52 |     5 |
|         3 |          5000 | 2018-05-10    |       12 |    20 |
|         3 |          5000 | 2018-05-11    |       14 |    20 |
+-----------+---------------+---------------+----------+-------+

Now a standard gaps-and-island step to collapse consecutive days with the same price together. I use a difference of two row number sequences here.

I've added some more rows to your sample data to see the gaps within the same ProductId.

INSERT INTO History (ProductId, DestinationId, ScheduledDate, Quantity)
VALUES
  (0, 1000, '20180601', 5),
  (0, 1000, '20180602', 10),
  (0, 1000, '20180603', 7),
  (3, 5000, '20180607', 15),
  (3, 5000, '20180608', 23),
  (3, 5000, '20180609', 52),
  (3, 5000, '20180610', 12),
  (3, 5000, '20180611', 14);

If you run this intermediate query you'll see how it works:

WITH
CTE_Prices
AS
(
    SELECT
        History.ProductId
        ,History.DestinationId
        ,History.ScheduledDate
        ,History.Quantity
        ,A.Price
    FROM
        History
        OUTER APPLY
        (
            SELECT TOP(1)
                PriceChange.Price
            FROM
                PriceChange
            WHERE
                PriceChange.ProductID = History.ProductID
                AND PriceChange.DestinationId = History.DestinationId
                AND PriceChange.EffectiveDate <= History.ScheduledDate
            ORDER BY
                PriceChange.EffectiveDate DESC
        ) AS A
)
,CTE_rn
AS
(
    SELECT
        ProductId
        ,DestinationId
        ,ScheduledDate
        ,Quantity
        ,Price
        ,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, Price ORDER BY ScheduledDate) AS rn1
        ,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
    FROM
        CTE_Prices
)
SELECT *
    ,rn2-rn1 AS Diff
FROM CTE_rn

Intermediate result

+-----------+---------------+---------------+----------+-------+-----+------+------+
| ProductId | DestinationId | ScheduledDate | Quantity | Price | rn1 | rn2  | Diff |
+-----------+---------------+---------------+----------+-------+-----+------+------+
|         0 |          1000 | 2018-04-01    |        5 |     1 |   1 | 6665 | 6664 |
|         0 |          1000 | 2018-04-02    |       10 |     2 |   1 | 6666 | 6665 |
|         0 |          1000 | 2018-04-03    |        7 |     2 |   2 | 6667 | 6665 |
|         0 |          1000 | 2018-06-01    |        5 |     2 |   3 | 6726 | 6723 |
|         0 |          1000 | 2018-06-02    |       10 |     2 |   4 | 6727 | 6723 |
|         0 |          1000 | 2018-06-03    |        7 |     2 |   5 | 6728 | 6723 |
|         3 |          5000 | 2018-05-07    |       15 |     5 |   1 | 6701 | 6700 |
|         3 |          5000 | 2018-05-08    |       23 |     5 |   2 | 6702 | 6700 |
|         3 |          5000 | 2018-05-09    |       52 |     5 |   3 | 6703 | 6700 |
|         3 |          5000 | 2018-05-10    |       12 |    20 |   1 | 6704 | 6703 |
|         3 |          5000 | 2018-05-11    |       14 |    20 |   2 | 6705 | 6703 |
|         3 |          5000 | 2018-06-07    |       15 |    20 |   3 | 6732 | 6729 |
|         3 |          5000 | 2018-06-08    |       23 |    20 |   4 | 6733 | 6729 |
|         3 |          5000 | 2018-06-09    |       52 |    20 |   5 | 6734 | 6729 |
|         3 |          5000 | 2018-06-10    |       12 |    20 |   6 | 6735 | 6729 |
|         3 |          5000 | 2018-06-11    |       14 |    20 |   7 | 6736 | 6729 |
+-----------+---------------+---------------+----------+-------+-----+------+------+

Now simply group by the Diff to get one row per interval.

Final query

WITH
CTE_Prices
AS
(
    SELECT
        History.ProductId
        ,History.DestinationId
        ,History.ScheduledDate
        ,History.Quantity
        ,A.Price
    FROM
        History
        OUTER APPLY
        (
            SELECT TOP(1)
                PriceChange.Price
            FROM
                PriceChange
            WHERE
                PriceChange.ProductID = History.ProductID
                AND PriceChange.DestinationId = History.DestinationId
                AND PriceChange.EffectiveDate <= History.ScheduledDate
            ORDER BY
                PriceChange.EffectiveDate DESC
        ) AS A
)
,CTE_rn
AS
(
    SELECT
        ProductId
        ,DestinationId
        ,ScheduledDate
        ,Quantity
        ,Price
        ,ROW_NUMBER() OVER (PARTITION BY ProductId, DestinationId, Price ORDER BY ScheduledDate) AS rn1
        ,DATEDIFF(day, '20000101', ScheduledDate) AS rn2
    FROM
        CTE_Prices
)
SELECT
    ProductId
    ,DestinationId
    ,MIN(ScheduledDate) AS StartDate
    ,MAX(ScheduledDate) AS EndDate
    ,SUM(Quantity) AS TotalQuantity
    ,Price
FROM
    CTE_rn
GROUP BY
    ProductId
    ,DestinationId
    ,Price
    ,rn2-rn1
ORDER BY
    ProductID
    ,DestinationId
    ,StartDate
;

Final result

+-----------+---------------+------------+------------+---------------+-------+
| ProductId | DestinationId | StartDate  |  EndDate   | TotalQuantity | Price |
+-----------+---------------+------------+------------+---------------+-------+
|         0 |          1000 | 2018-04-01 | 2018-04-01 |             5 |     1 |
|         0 |          1000 | 2018-04-02 | 2018-04-03 |            17 |     2 |
|         0 |          1000 | 2018-06-01 | 2018-06-03 |            22 |     2 |
|         3 |          5000 | 2018-05-07 | 2018-05-09 |            90 |     5 |
|         3 |          5000 | 2018-05-10 | 2018-05-11 |            26 |    20 |
|         3 |          5000 | 2018-06-07 | 2018-06-11 |           116 |    20 |
+-----------+---------------+------------+------------+---------------+-------+

Gaps And Islands: Splitting Islands Based On External Table

Tags:

Sql

Sql Server

Gaps And Islands

Related

Recent Posts