SARGable WHERE clause for two date columns
Just adding this quickly so it exists as an answer (though I know it's not the answer you want).
An indexed computed column is usually the right solution for this type of problem.
It:
- makes the predicate an indexable expression
- allows automatic statistics to be created for better cardinality estimation
- does not need to take any space in the base table
To be clear on that last point, the computed column is not required to be persisted in this case:
-- Note: not PERSISTED, metadata change only
ALTER TABLE #sargme
ADD DayDiff AS DATEDIFF(DAY, DateCol1, DateCol2);
-- Index the expression
CREATE NONCLUSTERED INDEX index_name
ON #sargme (DayDiff)
INCLUDE (DateCol1, DateCol2);
Now the query:
SELECT
S.ID,
S.DateCol1,
S.DateCol2,
DATEDIFF(DAY, S.DateCol1, S.DateCol2)
FROM
#sargme AS S
WHERE
DATEDIFF(DAY, S.DateCol1, S.DateCol2) >= 48;
...gives the following trivial plan:
As Martin Smith said, if you have connections using the wrong set options, you could create a regular column and maintain the computed value using triggers.
All this only really matters (code challenge aside) if there's a real problem to solve, of course, as Aaron says in his answer.
This is fun to think about, but I don't know any way to achieve what you want reasonably given the constraints in the question. It seems like any optimal solution would require a new data structure of some type; the closest we have being the 'function index' approximation provided by an index on a non-persisted computed column as above.
Risking ridicule from some of the biggest names in the SQL Server community, I'm going to stick my neck out and say, nope.
In order for your query to be SARGable, you'd have to basically construct a query that can pinpoint a starting row in a range of consecutive rows in an index. With the index ix_dates
, the rows are not ordered by the date difference between DateCol1
and DateCol2
, so your target rows could be spread out anywhere in the index.
Self-joins, multiple passes, etc. all have in common that they include at least one Index Scan, although a (nested loop) join may well use an Index Seek. But I can't see how it would be possible to eliminate the Scan.
As for getting more accurate row estimates, there are no statistics on the date difference.
The following, fairly ugly recursive CTE construct does technically eliminate scanning the whole table, although it introduces a Nested Loop Join and a (potentially very large) number of Index Seeks.
DECLARE @from date, @count int;
SELECT TOP 1 @from=DateCol1 FROM #sargme ORDER BY DateCol1;
SELECT TOP 1 @count=DATEDIFF(day, @from, DateCol1) FROM #sargme WHERE DateCol1<=DATEADD(day, -48, {d '9999-12-31'}) ORDER BY DateCol1 DESC;
WITH cte AS (
SELECT 0 AS i UNION ALL
SELECT i+1 FROM cte WHERE i<@count)
SELECT b.*
FROM cte AS a
INNER JOIN #sargme AS b ON
b.DateCol1=DATEADD(day, a.i, @from) AND
b.DateCol2>=DATEADD(day, 48+a.i, @from)
OPTION (MAXRECURSION 0);
It creates an Index Spool containing every DateCol1
in the table, then performs an Index Seek (range scan) for each of those DateCol1
and DateCol2
that are at least 48 days forward.
More IOs, slightly longer execution time, row estimate is still way off, and zero chance of parallelization because of the recursion: I'm guessing this query could possibly be useful if you have a very large number of values within relatively few distinct, consecutive DateCol1
(keeping the number of Seeks down).
I tried a bunch of wacky variations, but didn't find any version better than one of yours. The main problem is that your index looks like this in terms of how date1 and date2 are sorted together. The first column is going to be in a nice shelved line while the gap between them is going to be very jagged. You want this to look more like a funnel than the way it really will:
Date1 Date2
----- -------
* *
* *
* *
* *
* *
* *
* *
* *
There's not really any way I can think of to make that seekable for a certain delta (or range of deltas) between the two points. And I mean a single seek that's executed once + a range scan, not a seek that's executed for every row. That will involve a scan and/or a sort at some point, and these are things you want to avoid obviously. It's too bad you can't use expressions like DATEADD
/DATEDIFF
in filtered indexes, or perform any possible schema modifications that would allow a sort on the product of the date diff (like calculating the delta at insert/update time). As is, this seems to be one of those cases where a scan is actually the optimal retrieval method.
You said that this query was no fun, but if you look closer, this is by far the best one (and would be even better if you left out the compute scalar output):
SELECT
* ,
DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
FROM
[#sargme] AS [s]
WHERE
[s].[DateCol2] >= DATEADD(DAY, 48, [s].[DateCol1])
The reason is that avoiding the DATEDIFF
potentially shaves some CPU compared to a calculation against only the non-leading key column in the index, and also avoids some nasty implicit conversions to datetimeoffset(7)
(don't ask me why those are there, but they are). Here is the DATEDIFF
version:
<Predicate>
<ScalarOperator ScalarString="datediff(day,CONVERT_IMPLICIT(datetimeoffset(7),[splunge].[dbo].[sargme].[DateCol1] as [s].[DateCol1],0),CONVERT_IMPLICIT(datetimeoffset(7),[splunge].[dbo].[sargme].[DateCol2] as [s].[DateCol2],0))>=(48)">
And here's the one without DATEDIFF
:
<Predicate>
<ScalarOperator ScalarString="[splunge].[dbo].[sargme].[DateCol2] as [s].[DateCol2]>=dateadd(day,(48),[splunge].[dbo].[sargme].[DateCol1] as [s].[DateCol1])">
Also I found slightly better results in terms of duration when I changed the index to only include DateCol2
(and when both indexes were present, SQL Server always chose the one with one key and one include column vs. multi-key). For this query, since we have to scan all rows to find the range anyway, there is no benefit to have the second date column as part of the key and sorted in any way. And while I know we can't get a seek here, there is something inherently good-feeling about not hindering the ability to get one by forcing calculations against the leading key column, and only performing them against secondary or included columns.
If it were me, and I gave up on finding the sargable solution, I know which one I would choose - the one that makes SQL Server do the least amount of work (even if the delta is almost nonexistent). Or better yet I would relax my restrictions about schema change and the like.
And how much all of that matters? I don't know. I made the table 10 million rows and all of the above query variations still completed in under a second. And this is on a VM on a laptop (granted, with SSD).