Huge slowdown to SQL Server query on adding wildcard (or top)

This is a decision of the cost based optimiser.

The estimated costs used in this choice are incorrect as it assumes statistical independence between values in different columns.

It is similar to the issue described in Row Goals Gone Rogue where the even and odd numbers are negatively correlated.

It is easy to reproduce.

CREATE TABLE dbo.animal(
    id int IDENTITY(1,1) NOT NULL PRIMARY KEY,
    colour varchar(50) NOT NULL,
    species varchar(50) NOT NULL,
    Filler char(10) NULL
);

/*Insert 20 million rows with 1% black and 1% swan but no black swans*/
WITH T
     AS (SELECT TOP 20000000 ROW_NUMBER() OVER (ORDER BY @@SPID) AS RN
         FROM   master..spt_values v1,
                master..spt_values v2,
                master..spt_values v3)
INSERT INTO dbo.animal
            (colour,
             species)
SELECT CASE
         WHEN RN % 100 = 1 THEN 'black'
         ELSE CAST(RN % 100 AS VARCHAR(3))
       END,
       CASE
         WHEN RN % 100 = 2 THEN 'swan'
         ELSE CAST(RN % 100 AS VARCHAR(3))
       END
FROM   T 

/*Create some indexes*/
CREATE NONCLUSTERED INDEX ix_species ON  dbo.animal(species);
CREATE NONCLUSTERED INDEX ix_colour ON  dbo.animal(colour);

Now try

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black'
       AND species LIKE 'swan' 

This gives the plan below which is costed at 0.0563167.

enter image description here

The plan is able to perform a merge join between the results of the two indexes on the id column. (More details of merge join algorithm here).

Merge join requires both inputs to be ordered by the joining key.

The nonclustered indexes are ordered by (species, id) and (colour, id) respectively (nonunique non clustered indexes always have the row locator added in to the end of the key implicitly if not added explicitly). The query without any wildcards is performing an equality seek into species = 'swan' and colour ='black'. As each seek is only retrieving one exact value from the leading column the matching rows will be ordered by id therefore this plan is possible.

Query plan operators execute from left to right. With the left operator requesting rows from its children, which in turn request rows from their children (and so on until the leaf nodes are reached). The TOP iterator will stop requesting any more rows from its child once 10 have been received.

SQL Server has statistics on the indexes that tell it that 1% of the rows match each predicate. It assumes that these statistics are independent (i.e. not correlated either positively or negatively) so that on average once it has processed 1,000 rows matching the first predicate it will find 10 matching the second and can exit. (the plan above actually shows 987 rather than 1,000 but close enough).

In fact as the predicates are negatively correlated the actual plan shows that all 200,000 matching rows needed to be processed from each index but this is mitigated to some extent because the zero joined rows also means zero lookups were actually needed.

Compare with

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan' 

Which gives the plan below which is costed at 0.567943

enter image description here

The addition of the trailing wildcard has now caused an index scan. The cost of the plan is still quite low though for a scan on a 20 million row table.

Adding querytraceon 9130 shows some more information

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'       
OPTION (QUERYTRACEON 9130)  

enter image description here

It can be seen that SQL Server reckons it will only need to scan around 100,000 rows before it finds 10 matching the predicate and the TOP can stop requesting rows.

Again this makes sense with the independence assumption as 10 * 100 * 100 = 100,000

Finally lets try and force an index intersection plan

SELECT TOP 10 *
FROM   animal WITH (INDEX(ix_species), INDEX(ix_colour))
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan' 

This gives a parallel plan for me with estimated cost of 3.4625

enter image description here

The main difference here is that the colour like 'black%' predicate can now match multiple different colours. This means the matching index rows for that predicate are no longer guaranteed to be sorted in order of id.

For example the index seek on like 'black%' might return the following rows

+------------+----+
|   Colour   | id |
+------------+----+
| black      | 12 |
| black      | 20 |
| black      | 23 |
| black      | 25 |
| blackberry |  1 |
| blackberry | 50 |
+------------+----+

Within each colour the ids are ordered but the ids across different colours may well not be.

As a result SQL Server can no longer perform a merge join index intersection (without adding a blocking sort operator) and it opts to perform a hash join instead. Hash Join is blocking on the build input so now the cost reflects the fact that all matching rows will need to be processed from the build input rather than assuming it will only have to scan 1,000 as in the first plan.

The probe input is non blocking however and it still incorrectly estimates that it will be able to stop probing after processing 987 rows from that.

(Further info on Non-blocking vs. blocking iterators here)

Given the increased costs of the extra estimated rows and the hash join the partial clustered index scan looks cheaper.

In practice of course the "partial" clustered index scan is not partial at all and it needs to chug through the whole 20 million rows rather than the 100 thousand assumed when comparing the plans.

Increasing the value of the TOP (or removing it entirely) eventually encounters a tipping point where the number of rows it estimates the CI scan will need to cover makes that plan look more expensive and it reverts to the index intersection plan. For me the cut off point between the two plans is TOP (89) vs TOP (90).

For you it may well differ as it depends how wide the clustered index is.

Removing the TOP and forcing the CI scan

SELECT *
FROM   animal WITH (INDEX = 1)
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan' 

Is costed at 88.0586 on my machine for my example table.

If SQL Server was aware that the zoo had no black swans and that it would need to do a full scan rather than just reading 100,000 rows this plan would not be chosen.

I've tried multi column stats on animal(species,colour) and animal(colour,species) and filtered stats on animal (colour) where species = 'swan' but none of these help convince it that black swans don't exist and the TOP 10 scan will need to process more than 100,000 rows.

This is due to the "inclusion assumption" where SQL Server essentially assumes that if you are searching for something it probably exists.

On 2008+ there is a documented trace flag 4138 that turns off row goals. The effect of this is that the plan is costed without the assumption that the TOP will allow the child operators to terminate early without reading all matching rows. With this trace flag in place I naturally get the more optimal index intersection plan.

SELECT TOP 10 *
FROM   animal
WHERE  colour LIKE 'black%'
       AND species LIKE 'swan'
OPTION (QUERYTRACEON 4138)       

enter image description here

This plan now correctly costs for reading the full 200 thousand rows in both index seeks but over costs the key lookups (estimated 2 thousand vs actual 0. The TOP 10 would constrain this to a maximum of 10 but the trace flag prevents this being taken into account). Still the plan is costed significantly cheaper than the full CI scan so is selected.

Of course this plan might not be optimal for combinations which are common. Such as white swans.

A composite index on animal (colour, species) or ideally animal (species, colour) would allow the query to be much more efficient for both scenarios.

To make most efficient use of the composite index the LIKE 'swan' would also need to be changed to = 'swan'.

The table below shows the seek predicates and residual predicates shown in the execution plans for all four permutations.

+----------------------------------------------+-------------------+----------------------------------------------------------------+----------------------------------------------+
|                 WHERE clause                 |       Index       |                         Seek Predicate                         |              Residual Predicate              |
+----------------------------------------------+-------------------+----------------------------------------------------------------+----------------------------------------------+
| colour LIKE 'black%' AND species LIKE 'swan' | ix_colour_species | colour >= 'black' AND colour < 'blacL'                         | colour like 'black%' AND species like 'swan' |
| colour LIKE 'black%' AND species LIKE 'swan' | ix_species_colour | species >= 'swan' AND species <= 'swan'                        | colour like 'black%' AND species like 'swan' |
| colour LIKE 'black%' AND species = 'swan'    | ix_colour_species | (colour,species) >= ('black', 'swan')) AND colour < 'blacL'    | colour LIKE 'black%' AND species = 'swan'    |
| colour LIKE 'black%' AND species = 'swan'    | ix_species_colour | species = 'swan' AND (colour >= 'black' and colour <  'blacL') | colour like 'black%'                         |
+----------------------------------------------+-------------------+----------------------------------------------------------------+----------------------------------------------+

Founding this intriguing , i did some searching and stumbled upon this Q/A How (and why) does TOP impact an execution plan?

Basically, using TOP changes the cost of operators following it (in a nontrivial manner), which causes the overal plan to change too (it would be great if you included ExecPlans with and without TOP 10), which pretty much changes the overal execution of the query.

Hope this helps.

For instance, i tried it on a database and: -when no top is invoked, parallelism is used -with TOP, parallelism is not used

So, again, showing your execution plans would provide more info.

Have a nice day