Strange behaviour with sample sizes for statistics updates
Background
Data for the statistics object are gathered using a statement of the form:
SELECT
StatMan([SC0], [SC1], [SB0000])
FROM
(
SELECT TOP 100 PERCENT
[SC0], [SC1], STEP_DIRECTION([SC0]) OVER (ORDER BY NULL) AS [SB0000]
FROM
(
SELECT
[TextValue] AS [SC0],
[Id] AS [SC1]
FROM [dbo].[Test]
TABLESAMPLE SYSTEM (2.223684e+001 PERCENT)
WITH (READUNCOMMITTED)
) AS _MS_UPDSTATS_TBL_HELPER
ORDER BY
[SC0],
[SC1],
[SB0000]
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)
You can collect this statement with Extended Events or Profiler (SP:StmtCompleted
).
Statistics generation queries often access the base table (rather than a nonclustered index) to avoid the clustering of values that naturally occurs on nonclustered index pages.
The number of rows sampled depends on the number of whole pages selected for sampling. Each page of the table is either selected or it is not. All rows on selected pages contribute to the statistics.
Random numbers
SQL Server uses a random number generator to decide if a page qualifies or not. The generator used in this instance is the Lehmer random number generator with parameter values as shown below:
Xnext = Xseed * 75 mod (231 - 1)
The value of Xseed
is computed as the sum of:
The low integer part of the (
bigint
) base table'spartition_id
e.g.SELECT P.[partition_id] & 0xFFFFFFFF FROM sys.partitions AS P WHERE P.[object_id] = OBJECT_ID(N'dbo.Test', N'U') AND P.index_id = 1;
The value specified in the
REPEATABLE
clause- For sampled
UPDATE STATISTICS
, theREPEATABLE
value is 1. - This value is exposed in the
m_randomSeed
element of the access method's internal debugging information shown in execution plans when trace flag 8666 is enabled, for example<Field FieldName="m_randomSeed" FieldValue="1" />
- For sampled
For SQL Server 2012, this calculation occurs in sqlmin!UnOrderPageScanner::StartScan
:
mov edx,dword ptr [rcx+30h]
add edx,dword ptr [rcx+2Ch]
where memory at [rcx+30h]
contains the low 32 bits of the partition id and memory at [rcx+2Ch]
contains the REPEATABLE
value in use.
The random number generator is initialized later in the same method, calling sqlmin!RandomNumGenerator::Init
, where the instruction:
imul r9d,r9d,41A7h
...multiplies the seed by 41A7
hex (16807 decimal = 75) as shown in the equation above.
Later random numbers (for individual pages) are generated using the same basic code inlined into sqlmin!UnOrderPageScanner::SetupSubScanner
.
StatMan
For the example StatMan
query shown above, the same pages will be collected as for the T-SQL statement:
SELECT
COUNT_BIG(*)
FROM dbo.Test AS T
TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) -- Same sample %
REPEATABLE (1) -- Always 1 for statman
WITH (INDEX(0)); -- Scan base object
This will match the output of:
SELECT
DDSP.rows_sampled
FROM sys.stats AS S
CROSS APPLY sys.dm_db_stats_properties(S.[object_id], S.stats_id) AS DDSP
WHERE
S.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND S.[name] = N'IX_Test_TextValue';
Edge case
One consequence of using the MINSTD Lehmer random number generator is that seed values zero and int.max should not be used as this will result in the algorithm producing a sequence of zeroes (selecting every page).
The code detects zero, and uses a value from the system 'clock' as the seed in that case. It does not do the same if the seed is int.max (0x7FFFFFFF
= 231 - 1).
We can engineer this scenario since the initial seed is calculated as the sum of the low 32 bits of the partition id and the REPEATABLE
value. The REPEATABLE
value that will result in the seed being int.max and therefore every page being selected for sample is:
SELECT
0x7FFFFFFF - (P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1;
Working that into a complete example:
DECLARE @SQL nvarchar(4000) =
N'
SELECT
COUNT_BIG(*)
FROM dbo.Test AS T
TABLESAMPLE (0 PERCENT)
REPEATABLE (' +
(
SELECT TOP (1)
CONVERT(nvarchar(11), 0x7FFFFFFF - P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1
) + ')
WITH (INDEX(0));';
PRINT @SQL;
--EXECUTE (@SQL);
That will select every row on every page whatever the TABLESAMPLE
clause says (even zero percent).
This is an excellent question! I'll start with what I know for sure and then move onto speculation. Lots of details about this in my blog post here.
Sampled stats updates use TABLESAMPLE
behind the scenes. It's pretty easy to find documentation about that online. However, I believe it's not well known that the rows returned by TABLESAMPLE
partially depend on the hobt_id
of the object. When you drop and recreate the object you get a new hobt_id
so the rows returned by random sampling are different.
If you delete and reinsert the data the hobt_id
stays the same. As long as the data is laid out in the same way on disk (an allocation order scan returns the same results in the same order) then the sampled data should not change.
You can also change the number of rows sampled by rebuilding the clustered index on the table. For example:
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER; -- 273862 rows
ALTER INDEX PK_Test on Test REBUILD;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER; -- 273320 rows
As to why that happens, I believe it's because SQL Server scans the clustered index instead of the nonclustered index when gathering sampled stats on an index. I also think that there's a hidden (to those of us tracing the hidden stats update queries) value for REPEATABLE
used with TABLESAMPLE
. I haven't proven any of that, but it explains why your histogram and the rows sampled changes with a rebuild of the clustered index.
I forgot how TABLESAMPLE worked in terms of assigning a random probability per page. - Martin Smith
I saw this in Inside Microsoft SQL Server 2008: T-SQL Querying by Itzik Ben-Gan and I cannot add it as a comment so I post it here, I think it's interesting to others too:
See also Sampling Using TABLESAMPLE by Roji. P. Thomas.