Most Efficient Way to Retrieve Date Ranges
This is a hard problem to solve in general, but there are a couple of things we can do to help the optimizer choose a plan. This script creates a table with 10,000 rows with a known pseudo-random distribution of rows to illustrate:
CREATE TABLE dbo.SomeDateTable
(
Id INTEGER IDENTITY(1, 1) PRIMARY KEY NOT NULL,
StartDate DATETIME NOT NULL,
EndDate DATETIME NOT NULL
);
GO
SET STATISTICS XML OFF
SET NOCOUNT ON;
DECLARE
@i INTEGER = 1,
@s FLOAT = RAND(20120104),
@e FLOAT = RAND();
WHILE @i <= 10000
BEGIN
INSERT dbo.SomeDateTable
(
StartDate,
EndDate
)
VALUES
(
DATEADD(DAY, @s * 365, {d '2009-01-01'}),
DATEADD(DAY, @s * 365 + @e * 14, {d '2009-01-01'})
)
SELECT
@s = RAND(),
@e = RAND(),
@i += 1
END
The first question is how to index this table. One option is to provide two indexes on the DATETIME
columns, so the optimizer can at least choose whether to seek on StartDate
or EndDate
.
CREATE INDEX nc1 ON dbo.SomeDateTable (StartDate, EndDate)
CREATE INDEX nc2 ON dbo.SomeDateTable (EndDate, StartDate)
Naturally, the inequalities on both StartDate
and EndDate
mean that only one column in each index can support a seek in the example query, but this is about the best we can do. We might consider making the second column in each index an INCLUDE
rather than a key, but we might have other queries that can perform an equality seek on the leading column and an inequality seek on the second column. Also, we may get better statistics this way. Anyway...
DECLARE
@StartDateBegin DATETIME = {d '2009-08-01'},
@StartDateEnd DATETIME = {d '2009-10-15'},
@EndDateBegin DATETIME = {d '2009-08-05'},
@EndDateEnd DATETIME = {d '2009-10-22'}
SELECT
COUNT_BIG(*)
FROM dbo.SomeDateTable AS sdt
WHERE
sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
AND sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd
This query uses variables, so in general the optimizer will guess at selectivity and distribution, resulting in a guessed cardinality estimate of 81 rows. In fact, the query produces 2076 rows, a discrepancy that might be important in a more complex example.
On SQL Server 2008 SP1 CU5 or later (or R2 RTM CU1) we can take advantage of the Parameter Embedding Optimization to get better estimates, simply by adding OPTION (RECOMPILE)
to the SELECT
query above. This causes a compilation just before the batch executes, allowing SQL Server to 'see' the real parameter values and optimize for those. With this change, the estimate improves to 468 rows (though you do need to check the runtime plan to see this). This estimate is better than 81 rows, but still not all that close. The modelling extensions enabled by trace flag 2301 may help in some cases, but not with this query.
The problem is where the rows qualified by the two range searches overlap. One of the simplifying assumptions made in the optimizer's costing and cardinality estimation component is that predicates are independent (so if both have a selectivity of 50%, the result of applying both is assumed to qualify 50% of 50% = 25% of the rows). Where this sort of correlation is a problem, we can often work around it with multi-column and/or filtered statistics. With two ranges with unknown start and end points, this becomes impractical. This is where we sometimes have to resort to rewriting the query to a form that happens to produce a better estimate:
SELECT COUNT(*) FROM
(
SELECT
sdt.Id
FROM dbo.SomeDateTable AS sdt
WHERE
sdt.StartDate BETWEEN @StartDateBegin AND @StartDateEnd
INTERSECT
SELECT
sdt.Id
FROM dbo.SomeDateTable AS sdt
WHERE
sdt.EndDate BETWEEN @EndDateBegin AND @EndDateEnd
) AS intersected (id)
OPTION (RECOMPILE)
This form happens to produce a runtime estimate of 2110 rows (versus 2076 actual). Unless you have TF 2301 on, in which case the more advanced modelling techniques see through the trick and produce exactly the same estimate as before: 468 rows.
One day SQL Server might gain native support for intervals. If that comes with good statistical support, developers might dread tuning query plans like this a little less.
I do not know a solution that is fast for all data distributions, but if all your ranges are short, we can usually speed it up. If, for instance, ranges are shorter than one day, instead of this query:
SELECT TaskId ,
TaskDescription ,
StartedAt ,
FinishedAt
FROM dbo.Tasks
WHERE '20101203' BETWEEN StartedAt AND FinishedAt
we can add one more condition:
SELECT TaskId ,
TaskDescription ,
StartedAt ,
FinishedAt
FROM dbo.Tasks
WHERE '20101203' BETWEEN StartedAt AND FinishedAt
AND StartedAt >= '20101202'
AND FinishedAt <= '20101204' ;
As a result, instead of scanning the whole table the query will scan only two days' range, which is faster. If ranges may be longer, we can store them as sequences of shorter ones. Details here: Tuning SQL Queries with the Help of Constraints