Statistics update with automatic sampling messes up density vector and histogram
Why this weird behaviour?
Suppose someone gave you a book of a million integers and told you to make a guess about the overall distribution based on ten sampled numbers. The ten sampled numbers that you receive are 1, 1, 125000, 125000, 250000, 250000, 375000, 375000, 500000, and 500000. One perfectly reasonable guess is that the book contains 500000 unique integers in pairs ranging from 1 to 500000. Another equally valid guess is the book contains five unique integers with each integer present 200000 times. For your data distribution, SQL Server is making an inference about your data based on the sample that's closer to the latter interpretation. The algorithms to build histograms don't work well for all possible sampled data distributions, including one with each unique integer showing up exactly twice. I was able to easily reproduce your described issue in SQL Server 2017.
Is there some rule of thumb to determine a correct sampling rate or is the fullscan necessary?
The only rule of thumb that I'm aware of is that the correct sampling rate is at least equal to the sampling rate that gives you acceptable query or system performance. I've found DBAs to use a 100% sampling rate to avoid that analysis, and it's a perfectly reasonable shortcut to take for tables that aren't huge.
You could consider replacing your table with an updatable view built on two underlying tables. Each underlying table could contain one half of the data. SQL Server will build separate statistics objects for each table and you may get better performance without having to tweak statistics. I cannot say if this is a good approach for your specific scenario.
Should I create a statistics update job with fullscan for some statistics?
That is a common solution to the problem that you're describing. For another option, SQL Server 2016 SP1 CU4 introduces the PERSIST_SAMPLE_PERCENT keyword for STATISTICS
commands. You can use it to instruct SQL Server to always sample statistics at a certain rate for this column whenever an auto-stats update is triggered. If async stats updates are disabled then keep in mind that query compilation times may occasionally run long for queries that use your column.
And if that's the case, how do I know which are the statistics that need this kind of treatment?
The usual answer is to monitor your workload for unacceptable query performance, determine which queries have statistics issues, and to fix those issues with statistics changes. If you wish to be proactive you could consider analyzing statistics on a copy of production. First get sampled statistics for all relevant columns on the database and save all of the density vectors to a table. You can use DBCC SHOW_STATISTICS WITH DENSITY_VECTOR
to do that. Then update all of the same statistics with FULLSCAN
and save those density vectors as well. You can then compare the density vectors to find the statistics objects with the largest differences. This won't find all statistics issues, but it will find those similar to what you observed in this question.