How to get the last not-null value in an ordered column of a huge table?
A common solution to this type of problem is given by Itzik Ben-Gan in his article The Last non NULL Puzzle:
DROP TABLE IF EXISTS dbo.Example;
CREATE TABLE dbo.Example
(
id integer PRIMARY KEY,
val integer NULL
);
INSERT dbo.Example
(id, val)
VALUES
(1, 136),
(2, NULL),
(3, 650),
(4, NULL),
(5, NULL),
(6, NULL),
(7, 954),
(8, NULL),
(9, 104),
(10, NULL);
SELECT
E.id,
E.val,
lastval =
CAST(
SUBSTRING(
MAX(CAST(E.id AS binary(4)) + CAST(E.val AS binary(4))) OVER (
ORDER BY E.id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS integer)
FROM dbo.Example AS E
ORDER BY
E.id;
Demo: db<>fiddle
I expected t-sql to optimize it out - on a block/record level, the task to do is very easy and linear, essentially a for loop ( O(n) ).
That's not the query that you wrote. It may not be equivalent to the query that you wrote depending on some otherwise minor detail of the table schema. You're expecting too much from the query optimizer.
With the right indexing you can get the algorithm that you seek through the following T-SQL:
SELECT t1.id, ca.[VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t1
CROSS APPLY (
SELECT TOP (1) [VALUE]
FROM dbo.[BIG_TABLE(FOR_U)] t2
WHERE t2.ID <= t1.ID AND t2.[VALUE] IS NOT NULL
ORDER BY t2.ID DESC
) ca; --ORDER BY t1.ID ASC
For each row, the query processor traverses the index backwards and stops when it finds a row with a non null value for [VALUE]
. On my machine this finishes in about 90 seconds for 100 million rows in the source table. The query runs longer than necessary because some amount of time is wasted on the client discarding all of those rows.
It's not clear to me if you need ordered results or what you plan on doing with such a large result set. The query can be adjusted to meet the actual scenario. The biggest advantage of this approach is that it does not require a sort in the query plan. That can help for larger result sets. One disadvantage is that performance will not be optimal if there are a lot of NULLs in the table because many rows will be read from the index and discarded. You should be able to improve performance with a filtered index that excludes NULLs for that case.
Sample data for the test:
DROP TABLE IF EXISTS #t;
CREATE TABLE #t (
ID BIGINT NOT NULL
);
INSERT INTO #t WITH (TABLOCK)
SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS dbo.[BIG_TABLE(FOR_U)];
CREATE TABLE dbo.[BIG_TABLE(FOR_U)] (
ID BIGINT NOT NULL,
[VALUE] BIGINT NULL
);
INSERT INTO dbo.[BIG_TABLE(FOR_U)] WITH (TABLOCK)
SELECT 10000 * t1.ID + t2.ID, CASE WHEN (t1.ID + t2.ID) % 3 = 1 THEN t2.ID ELSE NULL END
FROM #t t1
CROSS JOIN #t t2;
CREATE UNIQUE CLUSTERED INDEX ADD_ORDERING ON dbo.[BIG_TABLE(FOR_U)] (ID);
One method, by using OVER()
and MAX()
and COUNT()
based on this source could be:
SELECT ID, MAX(value) OVER (PARTITION BY Value2) as value
FROM
(
SELECT ID, value
,COUNT(value) OVER (ORDER BY ID) AS Value2
FROM dbo.HugeTable
) a
ORDER BY ID;
Result
Id UpdatedValue
1 136
2 136
3 650
4 650
5 650
6 650
7 954
8 954
9 104
10 104
Another method based on this source, closely related to the first example
;WITH CTE As
(
SELECT value,
Id,
COUNT(value)
OVER(ORDER BY Id) As Value2
FROM dbo.HugeTable
),
CTE2 AS (
SELECT Id,
value,
First_Value(value)
OVER( PARTITION BY Value2
ORDER BY Id) As UpdatedValue
FROM CTE
)
SELECT Id,UpdatedValue
FROM CTE2;