Cardinality estimation problem on inner join
Based on your histograms I was able to repro the issue in 2017 CU6. I wouldn't say that you're doing something wrong. Rather, something is going wrong with cardinality estimation. Here's what I get before inserting a row:
The final cardinality estimate falls quite a bit after inserting a row:
You have a pretty simple repro here so my advice is to file product feedback or to open a support ticket with Microsoft. I was able to find a few workarounds that worked on your sample data and one of the might be acceptable for you.
- Drop the unique index on
CurrencyShareds.Id
. I can't get the repro to work without a unique index. The table is small, so maybe you can get by without the index. Of course, you might have very good reasons for keeping it. - Materialize the results of the join into a temp table. Based on your question it's important to get a reasonable estimate at this step so the larger query performs well. A temp table is one way to make that happen.
- Use the legacy CE. I can't get the issue to reproduce with it. Of course, this might have negative consequences on the rest of your query.
- Trick the query optimizer with silly code. For example, in my testing the following rewrite works great:
.
select Amount_TransactionCurrency_id, CurrencyShareds.id
from CurrencyShareds
INNER JOIN annexes
ON Amount_TransactionCurrency_id % 9223372036854775809 = CurrencyShareds.Id % 9223372036854775809
I suspect that this works because the CE appears to use the density instead of the histogram. Other similar rewrites may have the same effect. There's no guarantee that type of query will continue to work well in the future. That's why you should contact Microsoft to improve the odds that one day a fix for your issue will make it into the released product.
Ok, I hope I understand it now - so this our case
Given
- A reference table (CurrencyShareds) with ~100 rows, but ids are large, and min, max values differ very much - min: 119,762,190,797,406,464 vs max: 977,955,748,426,318,592
- A table (Annexes) that has simple FK to CurrencyShared, but only few Currencies are used - you can see that histogram for IX_FK_Amount_TransactionCurrency lists 5 ids - and what is important only those "low" ids, as others are not used.
When all stats are up to date then
CSelCalcExpressionComparedToExpression( QCOL: [test.MasterData].[dbo].[Annexes].Amount_TransactionCurrency_id x_cmpEq QCOL: [test.MasterData].[dbo].[CurrencyShareds].Id )
Loaded histogram for column QCOL: [test.MasterData].[dbo].[Annexes].Amount_TransactionCurrency_id from stats with id 7
Loaded histogram for column QCOL: [test.MasterData].[dbo].[CurrencyShareds].Id from stats with id 1
Selectivity: 0.01
Then selectivity calculated for the join is fine, as 100 * 107,131 * 0.01 = 107,131
When stats for currencyshareds are not up to date, then
CSelCalcExpressionComparedToExpression( QCOL: [test.MasterData].[dbo].[Annexes].Amount_TransactionCurrency_id x_cmpEq QCOL: [test.MasterData].[dbo].[CurrencyShareds].Id )
Loaded histogram for column QCOL: [test.MasterData].[dbo].[Annexes].Amount_TransactionCurrency_id from stats with id 7
Loaded histogram for column QCOL: [test.MasterData].[dbo].[CurrencyShareds].Id from stats with id 1 *** WARNING: badly-formed histogram ***
Selectivity: 4.59503e-018
Selectivity drops dramatically, and hence the estimated row number of the join is 1.
When histogram changes
After I add a single row to annexes that refrences CurrencyShared with high id, then in result the histogram for IX_FK_Amount_TransactionCurrency changes to
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
-------------------- ------------- ------------- -------------------- --------------
119762190797406475 0 173 0 1
119762190797406478 0 868 0 1
119762190797406481 0 107 0 1
119762190797406494 0 105745 0 1
119762190797406496 0 330 0 1
119762190797406618 0 1 0 1
119762190797406628 0 1 0 1
977955748426318623 0 1 0 1
With this histogram the problem disappears, now adding a new row to currencyshareds does not cause dramatic drop in cardinality estimation.
Why is that?
I suspect this is how the coarse histogram estimation algorithm works in sql2014+, and I am basing my guess on this great post https://www.sqlshack.com/join-estimation-internals/
Coarse Histogram Estimation is a new algorithm and less documented, even in terms of general concepts. It is known that instead of aligning histograms step by step, it aligns them with only minimum and maximum histogram boundaries. This method potentially introduces less CE mistakes (not always however, because we remember that this is just a model).
Just to make everything clear - why do we have such strange ids in currencyshareds?
It's quite simple - our ids are globally unique and are based in part on timestamp (implementation based on snowflake). The most common currencies were added at the start of the application several years ago, and only those few are really used in production, that is why in histogram there are only those with "low" id.
The problem surfaced on our test environments, where some automated tests started adding test currencies, causing some queries to execute longer or to timeout...
How to fix the problem?
We'll update statistics for those reference tables (we might have a similar problem with other similar reference data tables) more often - those tables are small so updating stats is not a problem
Lessons Learned
- Up to date stats are important!!!
- plain old identity column would not cause these problems :)