Cardinality Estimate for LIKE operator (Local Variables)
The guess for LIKE
in your case is based on:
G
: The standard 9% guess (sqllang!x_Selectivity_Like
)M
: A factor of 6 (magic number)D
: Average data length in bytes (from statistics), rounded down to integer
Specifically, sqllang!CCardUtilSQL7::ProbLikeGuess
uses:
Selectivity (S) = G / M * LOG(D)
Notes:
- The
LOG(D)
term is omitted ifD
is between 1 and 2. - If
D
is less than 1 (including for missing orNULL
statistics):
D = FLOOR(0.5 * maximum column byte length)
This sort of quirkiness and complexity is quite typical of the original CE.
In the question example, the average length is 5 (5.6154 from DBCC SHOW_STATISTICS
rounded down):
Estimate = 10,000 * (0.09 / 6 * LOG(5)) = 241.416
Other example values:
D = Estimate using formula for S 15 = 406.208 14 = 395.859 13 = 384.742 12 = 372.736 11 = 359.684 10 = 345.388 09 = 329.584 08 = 311.916 07 = 291.887 06 = 268.764 05 = 241.416 04 = 207.944 03 = 164.792 02 = 150.000 (LOG not used) 01 = 150.000 (LOG not used) 00 = 291.887 (LOG 7) /* FLOOR(0.5 * 15) [15 since lastname is varchar(15)] */
Test rig
DECLARE
@CharLength integer = 5, -- Set length here
@Counter integer = 1;
CREATE TABLE #T (c1 varchar(15) NULL);
-- Add 10,000 rows
SET NOCOUNT ON;
SET STATISTICS XML OFF;
BEGIN TRANSACTION;
WHILE @Counter <= 10000
BEGIN
INSERT #T (c1) VALUES (REPLICATE('X', @CharLength));
SET @Counter = @Counter + 1;
END;
COMMIT TRANSACTION;
SET NOCOUNT OFF;
SET STATISTICS XML ON;
-- Test query
DECLARE @Like varchar(15);
SELECT * FROM #T AS T
WHERE T.c1 LIKE @Like;
DROP TABLE #T;
I tested on SQL Server 2014 with the legacy CE and did not get 9% as a cardinality estimate either. I couldn't find anything accurate online so I did some testing and I found a model that fits all of the test cases that I tried, but I can't be sure that it's complete.
In the model that I found, the estimate is derived from the number of rows in the table, the average key length of the statistics for the filtered column, and sometimes the datatype length of the filtered column. There are two different formulas used for the estimation.
If FLOOR(average key length) = 0 then the estimation formula ignores the column statistics and creates an estimate based on the datatype length. I only tested with VARCHAR(N) so it's possible that there's a different formula for NVARCHAR(N). Here is the formula for VARCHAR(N):
(row estimate) = (rows in table) * (-0.004869 + 0.032649 * log10(length of data type))
This has a very nice fit, but it's not perfectly accurate:
The x-axis is the length of the data type and the y axis is the number of estimated rows for a table with 1 million rows.
The query optimizer would use this formula if you did not have statistics on the column or if the column has enough NULL values to drive the average key length to below 1.
For example, suppose that you had a table with 150k rows with filtering on a VARCHAR(50) and no column statistics. The row estimate prediction is:
150000 * (-0.004869 + 0.032649 * log10(50)) = 7590.1 rows
SQL to test it:
CREATE TABLE X_CE_LIKE_TEST_1 (
STRING VARCHAR(50)
);
CREATE STATISTICS X_STAT_CE_LIKE_TEST_1 ON X_CE_LIKE_TEST_1 (STRING) WITH NORECOMPUTE;
WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B CROSS JOIN L2 C),
NUMS AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS NUM FROM L4)
INSERT INTO X_CE_LIKE_TEST_1 WITH (TABLOCK) (STRING)
SELECT TOP (150000) 'ZZZZZ'
FROM NUMS
ORDER BY NUM;
DECLARE @LastName VARCHAR(15) = 'BA%'
SELECT * FROM X_CE_LIKE_TEST_1
WHERE STRING LIKE @LastName;
SQL Server gives an estimate row count of 7242.47 which is kind of close.
If FLOOR(average key length) >= 1 then a different formula is used that is based on the value of FLOOR(average key length). Here is a table of some of the values that I tried:
1 1.5%
2 1.5%
3 1.64792%
4 2.07944%
5 2.41416%
6 2.68744%
7 2.91887%
8 3.11916%
9 3.29584%
10 3.45388%
If FLOOR(average key length) < 6 then use the table above. Otherwise use the following equation:
(row estimate) = (rows in table) * (-0.003381 + 0.034539 * log10(FLOOR(average key length)))
This one has a better fit than the other one, but it's still not perfectly accurate.
The x-axis is the average key length and the y axis is the number of estimated rows for a table with 1 million rows.
To give another example, suppose that you had a table with 10k rows with an average key length of 5.5 for the statistics on the filtered column. The row estimate would be:
10000 * 0.241416 = 241.416 rows.
SQL to test it:
CREATE TABLE X_CE_LIKE_TEST_2 (
STRING VARCHAR(50)
);
WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B CROSS JOIN L2 C),
NUMS AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS NUM FROM L4)
INSERT INTO X_CE_LIKE_TEST_2 WITH (TABLOCK) (STRING)
SELECT TOP (10000)
CASE
WHEN NUM % 2 = 1 THEN REPLICATE('Z', 5)
ELSE REPLICATE('Z', 6)
END
FROM NUMS
ORDER BY NUM;
CREATE STATISTICS X_STAT_CE_LIKE_TEST_2 ON X_CE_LIKE_TEST_2 (STRING)
WITH NORECOMPUTE, FULLSCAN;
DECLARE @LastName VARCHAR(15) = 'BA%'
SELECT * FROM X_CE_LIKE_TEST_2
WHERE STRING LIKE @LastName;
The row estimate is 241.416 which matches what you have in the question. There would be some error if I used a value not in the table.
The models here aren't perfect but I think that they illustrate the general behavior pretty well.