Are RANK() and DENSE_RANK() deterministic or non-deterministic?
According to official Microsoft BOL DENSE_RANK is nondeterministic (RANK()). But according to Ranking Functions by Itzik Ben-Gan "... the RANK() and DENSE_RANK() functions are always deterministic". Who is right?
They are both right, because they are using different senses of the word "deterministic".
From the SQL Server optimizer's point of view, "deterministic" has a very precise meaning; a meaning that existed before window and ranking functions were added to the product. To the optimizer, the "deterministic" property defines whether a function can be freely duplicated within its internal tree structures during optimization. This is not legal for a non-deterministic function.
Deterministic here means: the exact instance of the function always returns the same output for the same input, no matter how many times it is called. This is never true for windowing functions, by definition, because as a (single-row) scalar function, they do not return the same result within a row or across rows. To state it simply, using ROW_NUMBER
as an example:
The
ROW_NUMBER
function returns different values for different rows (by definition!), so for optimization purposes it is nondeterministic
This is the sense BOL is using.
Itzik is making a different point about the determinism of the result as a whole. Over an ordered input set (with suitable tie-breaking) the output is a "deterministic" sequence. That is a valid observation, but it is not the "deterministic" quality that is important during query optimization.
NTILE()
is an interesting case; it seems to apply after sorting (which, in the case of a tie, is left to SQL Server's own devices, and this is usually driven by the most efficient choice of index for sorting purposes). You can make this deterministic by not forcing SQL Server to make an arbitrary choice here - add one or more tie-breakers to the OVER()
clause:
OVER (ORDER BY Salary, Employee)
Essentially you need to make the sorting unique. If you have employees with the same name, you may have to choose a different tie-breaker column or keep adding columns until there really can be no ties.
For RANK()
and DENSE_RANK()
, ties are actually a crucial reason that you can't get different values. Try not to confuse determinism of the output of the function with determinism of the order of the results. If your queries don't have ORDER BY
, then what is not deterministic about this?
1 1 Sue Right
1 1 Robin Page
1 1 Phil Factor
1 1 Phil Factor
1 1 Sue Right
1 1 Robin Page
RANK()
and DENSE_RANK()
applied the same values in both cases, SQL Server just returned the results to you in a different order. This has nothing to do with expecting the same output from RANK()
or DENSE_RANK()
given the same input - this is just about assuming or expecting some deterministic order when you've told SQL Server (by omitting an ORDER BY
clause) that you don't care about the order of the results. See #3 here:
- T-SQL Tuesday #56 : SQL Server Assumptions
Syntax:
WindowFunction() OVER (PARTITION BY <some expressions> -- partition list
ORDER BY <some other expressions>) -- order list
Both functions, RANK()
and DENSE_RANK()
, by their definitions, are guaranteed to produce the same results as long as the expressions in the OVER
clause are deterministic themselves. And that is what Itzik Ben-Gun meant in his article. These lists are most often just columns of the tables involved.
So, while the functions is general are not deterministic, their implementation could have taken care to distinguish the two cases and consider them deterministic or not, upon examining the partition and order lists.
My wild guess, is that the SQL-Server developers decided it was easier to implement them as always "non-deterministic" despite that this contradicts in a way their definition of deterministic functions. So, they are stated as non-deterministic in MSDN because in the current implementation, the engine considers them always as non-deterministic.
One more argument is that the other two window functions, ROW_NUMBER()
and NTILE()
, are even more complicated because for them having identical output, the expression in the partition and order by lists not only have to be deterministic but unique as well. So, implementing all those details is far from trivial.
I will not comment on the order of the result sets, as this has nothing to do with the determinism, as Aaron Bertrand has clearly explained in his answer.