Poor cardinality estimate disqualifies INSERT from minimal logging?
Why is it that the second query does not qualify for minimal logging?
Minimal logging is available for the second query, but the engine chooses not to use it at runtime.
There is a minimum threshold for INSERT...SELECT
below which it chooses not to use the bulk load optimizations. There is a cost involved in setting up a bulk rowset operation, and bulk-inserting only a few rows would not result in efficient space utilization.
What can be done to improve the situation?
Use one of the many other methods (e.g. SELECT INTO
) that does not have this threshold. Alternatively, you might be able to rewrite the source query in some way to boost the estimated number of rows/pages over the threshold for INSERT...SELECT
.
See also Geoff's self-answer for more useful information.
Possibly interesting trivia: SET STATISTICS IO
reports logical reads for the target table only when bulk loading optimizations are not used.
I was able to recreate the problem with my own test rig:
USE test;
CREATE TABLE dbo.SourceGood
(
SourceGoodID INT NOT NULL
CONSTRAINT PK_SourceGood
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeData VARCHAR(384) NOT NULL
);
CREATE TABLE dbo.SourceBad
(
SourceBadID INT NOT NULL
CONSTRAINT PK_SourceBad
PRIMARY KEY CLUSTERED
IDENTITY(-2147483647,1)
, SomeData VARCHAR(384) NOT NULL
);
CREATE TABLE dbo.InsertTest
(
SourceBadID INT NOT NULL
CONSTRAINT PK_InsertTest
PRIMARY KEY CLUSTERED
, SomeData VARCHAR(384) NOT NULL
);
GO
INSERT INTO dbo.SourceGood WITH (TABLOCK) (SomeData)
SELECT TOP(5000000) o.name + o1.name + o2.name
FROM syscolumns o
, syscolumns o1
, syscolumns o2;
GO
ALTER DATABASE test SET AUTO_UPDATE_STATISTICS OFF;
GO
INSERT INTO dbo.SourceBad WITH (TABLOCK) (SomeData)
SELECT TOP(5000000) o.name + o1.name + o2.name
FROM syscolumns o
, syscolumns o1
, syscolumns o2;
GO
ALTER DATABASE test SET AUTO_UPDATE_STATISTICS ON;
GO
BEGIN TRANSACTION;
INSERT INTO dbo.InsertTest WITH (TABLOCK)
SELECT *
FROM dbo.SourceGood;
SELECT * FROM sys.dm_tran_database_transactions;
/*
database_transaction_log_record_count
472
database_transaction_log_bytes_used
692136
*/
COMMIT TRANSACTION;
BEGIN TRANSACTION;
INSERT INTO dbo.InsertTest WITH (TABLOCK)
SELECT *
FROM dbo.SourceBad;
SELECT * FROM sys.dm_tran_database_transactions;
/*
database_transaction_log_record_count
5000003
database_transaction_log_bytes_used
642699256
*/
COMMIT TRANSACTION;
This begs the question, why not "fix" the problem by updating statistics on the source tables prior to running the minimally-logged operation?
TRUNCATE TABLE dbo.InsertTest;
UPDATE STATISTICS dbo.SourceBad;
BEGIN TRANSACTION;
INSERT INTO dbo.InsertTest WITH (TABLOCK)
SELECT *
FROM dbo.SourceBad;
SELECT * FROM sys.dm_tran_database_transactions;
/*
database_transaction_log_record_count
472
database_transaction_log_bytes_used
692136
*/
COMMIT TRANSACTION;
Rewrite the source query in some way to boost the estimated number of rows
Expanding on Paul's idea, a workaround if you are truly desperate is to add a dummy table that guarantees that the estimated number of rows for the insert will be high enough to quality for bulk loading optimizations. I confirmed that this gets minimal logging and improves query performance.
-- Create a dummy table that SQL Server thinks has a million rows
CREATE TABLE dbo.emptyTableWithMillionRowEstimate (
n INT PRIMARY KEY
)
GO
UPDATE STATISTICS dbo.emptyTableWithMillionRowEstimate
WITH ROWCOUNT = 1000000
GO
-- Concatenate this table into the final rowset:
INSERT INTO dbo.minimalLoggingTest WITH (TABLOCK) (n)
SELECT n
-- Any table/view/sub-query that correctly estimates that it will generate 5MM rows
FROM dbo.fiveMillionNumbersBadEstimate
-- Add in dummy rowset to ensure row estimate is high enough for bulk load optimization
UNION ALL
SELECT NULL FROM dbo.emptyTableWithMillionRowEstimate
OPTION (MAXDOP 1)
Final takeaways
- Use
SELECT...INTO
for one-time insert operations if minimally logging is required. As Paul points out, this will ensure minimal logging regardless of the row estimate - Wherever possible, write queries in a simple manner that the query optimizer can reason about effectively. It may be possible to break up a query into multiple pieces, for example, in order to allow statistics to be built on an intermediate table.
- If you have access to SQL Server 2014, try it out on your query; in my actual production case, I just tried it out and the new Cardinality Estimator yielded a much higher (and better) estimate; the query then was minimally logged. But this may not be helpful if you need to support SQL 2012 and earlier.
- If you're desperate, hacky solutions like this one may apply!
A related article
Paul White's May 2019 blog post Minimal Logging with INSERT…SELECT into Heap Tables covers some of this information in more detail.