Improve performance of query using IN()
Given tables of the following general form:
CREATE TABLE Device
(
ID integer PRIMARY KEY
);
CREATE TABLE EventType
(
ID integer PRIMARY KEY,
Name nvarchar(50) NOT NULL
);
CREATE TABLE [Event]
(
ID integer PRIMARY KEY,
[TimeStamp] datetime NOT NULL,
EventTypeID integer NOT NULL REFERENCES EventType,
DeviceID integer NOT NULL REFERENCES Device
);
The following index is useful:
CREATE INDEX f1
ON [Event] ([TimeStamp], EventTypeID)
INCLUDE (DeviceID)
WHERE EventTypeID IN (2, 5, 7, 8, 9, 14);
For the query:
SELECT
[Event].ID,
[Event].[TimeStamp],
EventType.Name,
Device.ID
FROM
[Event]
INNER JOIN EventType ON EventType.ID = [Event].EventTypeID
INNER JOIN Device ON Device.ID = [Event].DeviceID
WHERE
[Event].[TimeStamp] BETWEEN '2011-01-28' AND '2011-01-29'
AND Event.EventTypeID IN (2, 5, 7, 8, 9, 14);
The filter meets the AND
clause requirement, the first key of the index allows a seek on [TimeStamp]
for the filtered EventTypeIDs
and including the DeviceID
column makes the index covering (because DeviceID
is required for the join to the Device
table).
The second key of the index - EventTypeID
is not strictly required (it could also be an INCLUDEd
column); I have included it in the key for the reasons stated here. In general, I advise people to at least INCLUDE
columns from a filtered index WHERE
clause.
Based on the updated query and execution plan in the question, I agree that the more general index suggested by SSMS is likely the better choice here, unless the list of filtered EventTypeIDs
is static as Aaron also mentions in his answer:
CREATE TABLE Device
(
ID integer PRIMARY KEY,
Name nvarchar(50) NOT NULL UNIQUE
);
CREATE TABLE EventType
(
ID integer PRIMARY KEY,
Name nvarchar(20) NOT NULL UNIQUE,
[Description] nvarchar(100) NOT NULL
);
CREATE TABLE [Event]
(
ID integer PRIMARY KEY,
PLCTimeStamp datetime NOT NULL,
EventTypeID integer NOT NULL REFERENCES EventType,
DeviceID integer NOT NULL REFERENCES Device,
IATA varchar(50) NOT NULL,
Data1 integer NULL,
Data2 integer NULL,
);
Suggested index (declare it unique if that is appropriate):
CREATE UNIQUE INDEX uq1
ON [Event]
(EventTypeID, PLCTimeStamp)
INCLUDE
(DeviceID, IATA, Data1, Data2, ID);
Cardinality information from the execution plan (undocumented syntax, do not use in production systems):
UPDATE STATISTICS dbo.Event WITH ROWCOUNT = 4042700, PAGECOUNT = 400000;
UPDATE STATISTICS dbo.EventType WITH ROWCOUNT = 22, PAGECOUNT = 1;
UPDATE STATISTICS dbo.Device WITH ROWCOUNT = 2806, PAGECOUNT = 28;
Updated query (repeating the IN
list for the EventType
table helps the optimizer in this specific case):
SELECT
Event.ID,
Event.IATA,
Device.Name,
EventType.Description,
Event.Data1,
Event.Data2,
Event.PLCTimeStamp,
Event.EventTypeID
FROM
Event
INNER JOIN EventType ON EventType.ID = Event.EventTypeID
INNER JOIN Device ON Device.ID = Event.DeviceID
WHERE
Event.EventTypeID IN (3, 30, 40, 41, 42, 46, 49, 50)
AND EventType.ID IN (3, 30, 40, 41, 42, 46, 49, 50)
AND Event.PLCTimeStamp BETWEEN '2011-01-28' AND '2011-01-29'
AND Event.IATA LIKE '%0005836217%'
ORDER BY Event.ID;
Estimated execution plan:
The plan you get will likely be different because I am using guessed statistics. The general point is to give the optimizer as much information as you can, and provide an efficient access method (index) on the 4-million row [Event]
table.
The majority of the cost is the clustered index scan, and unless this table is really wide or you don't really need all those columns in the output, I believe SQL Server that this is the optimal path in the current scenario with nothing else changed. It does use a range scan (labeled as a CI seek) to narrow down the range of rows it's interested in, but because of the output it is still going to require either a lookup or a CI scan even with the filtered index you created that is targeted at this range, and even in that case the CI scan is probably still cheapest (or at least SQL Server estimates it as such).
The execution plan does tell you that this index would be useful:
CREATE NONCLUSTERED INDEX ix_EventTypeID_PLCTimeStamp_WithIncludes
ON [dbo].[Event] ([EventTypeID],[PLCTimeStamp])
INCLUDE ([ID],[DeviceID],[Data1],[Data2],[IATA]);
Though depending on your data skew it might be better the other way around, e.g.:
CREATE NONCLUSTERED INDEX ix_PLCTimeStamp_EventTypeID_WithIncludes
ON [dbo].[Event] ([PLCTimeStamp],[EventTypeID])
INCLUDE ([ID],[DeviceID],[Data1],[Data2],[IATA]);
But I would test both to be sure which is better, if either - the difference between either of those indexes and what you have now may only be marginal (too many variables for us to know) and you have to take into account that an additional index requires extra maintenance, and this can noticeably affect your DML operations (insert/update/delete). You may also consider including the filter criteria in this index as suggested by @SQLKiwi, but only if that is the set of EventTypeID values you search for frequently. If that set changes over time, then the filtered index will only be useful for this specific query.
With such a low row count, I have to wonder how bad the performance could possibly be currently? This query returns 3 rows (but there isn't any indication of how many rows it rejected). How many rows in the table?
I just discover that SQL Server 2008 R2 actually made an index suggestion when I ran the execution plan. This suggested index makes the query run about 90% faster.
The index it suggested was the following:
CREATE NONCLUSTERED INDEX [INDEX_spBagSearch] ON [dbo].[Event]
(
[EventTypeID] ASC,
[PLCTimeStamp] ASC
)
INCLUDE ( [ID],
[DeviceID],
[Data1],
[Data2],
[IATA]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO