Forcing Flow Distinct
The SQL Server optimizer cannot produce the execution plan you are after with the guarantee you need, because the Hash Match Flow Distinct operator is not order-preserving.
Though, I'm not sure (and suspect not) if this truly guarantees ordering.
You may observe order preservation in many cases, but this is an implementation detail; there is no guarantee, so you cannot rely on it. As always, presentation order can only be guaranteed by a top-level ORDER BY
clause.
Example
The script below shows that Hash Match Flow Distinct does not preserve order. It sets up the table in question with matching numbers 1-50,000 in both columns:
IF OBJECT_ID(N'dbo.Updates', N'U') IS NOT NULL
DROP TABLE dbo.Updates;
GO
CREATE TABLE Updates
(
UpdateId INT NOT NULL IDENTITY(1,1),
ObjectId INT NOT NULL,
CONSTRAINT PK_Updates_UpdateId PRIMARY KEY (UpdateId)
);
GO
INSERT dbo.Updates (ObjectId)
SELECT TOP (50000)
ObjectId =
ROW_NUMBER() OVER (
ORDER BY C1.[object_id])
FROM sys.columns AS C1
CROSS JOIN sys.columns AS C2
ORDER BY
ObjectId;
The test query is:
DECLARE @Rows bigint = 50000;
-- Optimized for 1 row, but will be 50,000 when executed
SELECT DISTINCT TOP (@Rows)
U.ObjectId
FROM dbo.Updates AS U
WHERE
U.UpdateId > 0
OPTION (OPTIMIZE FOR (@Rows = 1));
The estimated plan shows an index seek and flow distinct:
The output certainly seems ordered to start with:
...but further down values start to go 'missing':
...and eventually:
The explanation in this particular case, is that the hash operator spills:
Once a partition spills, all rows that hash to the same partition also spill. Spilled partitions are processed later, breaking the expectation that distinct values encountered will be emitted immediately in the sequence they are received.
There are many ways to write an efficient query to produce the ordered result you want, such as recursion or using a cursor. However, it cannot be done using Hash Match Flow Distinct.
I'm unsatisfied with this answer because I couldn't manage to get a flow distinct operator along with results that were guaranteed to be correct. However, I have an alternative which should get good performance along with correct results. Unfortunately it requires that a nonclustered index be created on the table.
I approached this problem by trying to think of a combination of columns that I could ORDER BY
and get the correct results after applying DISTINCT
to them. The minimum value of UpdateId
per ObjectId
along with ObjectId
is one such combination. However, directly asking for the minimum UpdateId
seems to result in reading all rows from the table. Instead, we can indirectly ask for the minimum value of UpdateId
with another join to the table. The idea is to scan the Updates
table in order, throw out any rows for which UpdateId
isn't the minimum value for that row's ObjectId
, and keep the first 100 rows. Based on your description of the data distribution we shouldn't need to throw out very many rows.
For data prep, I put 1 million rows into a table with 2 rows for each distinct ObjectId:
INSERT INTO Updates WITH (TABLOCK)
SELECT t.RN / 2
FROM
(
SELECT TOP 1000000 -1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) t;
CREATE INDEX IX On Updates (Objectid, UpdateId);
The nonclustered index on Objectid
and UpdateId
is important. It allows us to efficiently throw out rows that don't have the minimum UpdateId
per Objectid
. There are many ways to write a query that matches the description above. Here's one such way using NOT EXISTS
:
DECLARE @fromUpdateId INT = 9999;
SELECT ObjectId
FROM (
SELECT DISTINCT TOP 100 u1.UpdateId, u1.ObjectId
FROM Updates u1
WHERE UpdateId > @fromUpdateId
AND NOT EXISTS (
SELECT 1
FROM Updates u2
WHERE u2.UpdateId > @fromUpdateId
AND u1.ObjectId = u2.ObjectId
AND u2.UpdateId < u1.UpdateId
)
ORDER BY u1.UpdateId, u1.ObjectId
) t;
Here's a picture of the query plan:
In the best case SQL Server will only do 100 index seeks against the nonclustered index. To simulate getting very unlucky I changed the query to return the first 5000 rows to the client. That resulted in 9999 index seeks, so it's like getting an average of 100 rows per distinct ObjectId
. Here is the output from SET STATISTICS IO, TIME ON
:
Table 'Updates'. Scan count 10000, logical reads 31900, physical reads 0
SQL Server Execution Times: CPU time = 31 ms, elapsed time = 42 ms.
I love the question - Flow Distinct is one of my favourite operators.
Now, the guarantee is the problem. When you think about the FD operator pulling rows from the Seek operator in an ordered fashion, producing each row as it determines it to be unique, this will give you the rows in the right order. But it's hard to know whether there might be some scenarios where the FD doesn't handle a single row at a time.
Theoretically, the FD could request 100 rows from the Seek, and produce them in whatever order it needs them.
The query hints OPTION (FAST 1, MAXDOP 1)
could help, because it'll avoid getting more rows than it needs from the Seek operator. Is it a guarantee though? Not quite. It could still decide to pull a page of rows at a time, or something like that.
I think with OPTION (FAST 1, MAXDOP 1)
, your OFFSET
version would give you a lot of confidence about the order, but it's not a guarantee.