How does one investigate the performance of a BULK INSERT statement?
As far as I can tell you can optimize a bulk insert in a very similar way that you'd optimize a regular insert. Typically, a query plan for a simple insert isn't very informative so don't worry about not having the plan. I'll go over a few ways of optimizing an insert but most of them probably don't apply for the insert you specified in the question. However, they could be helpful if in the future you need to load larger amounts of data.
1. Insert data in clustering key order
SQL Server will often sort data before inserting it into a table with a clustered index. For some tables and applications you can improve performance by sorting the data in the flat file and letting SQL Server know that the data is sorted through the ORDER
argument of BULK INSERT
:
ORDER ( { column [ ASC | DESC ] } [ ,... n ] )
Specifies how the data in the data file is sorted. Bulk import performance is improved if the data being imported is sorted according to the clustered index on the table, if any.
Since you are using an IDENTITY
column as the clustered key you don't need to worry about this.
2. Use TABLOCK
if possible
If you are guaranteed to have only one session inserting data into your table you can specify the TABLOCK
argument for BULK INSERT
. This can reduce lock contention and can lead to minimal logging in some scenarios. However, you are inserting into a table with a clustered index that already contains data so you will not get minimal logging without trace flag 610 which is mentioned later in this answer.
If TABLOCK
is not possible, because you can't change the code, not all hope is lost. Consider using sp_table_option
:
EXEC [sys].[sp_tableoption]
@TableNamePattern = N'dbo.BulkLoadTable' ,
@OptionName = 'table lock on bulk load' ,
@OptionValue = 'ON'
Another option is to enable trace flag 715.
3. Use an appropriate batch size
Sometimes you will be able to tune inserts by changing the batch size.
ROWS_PER_BATCH = rows_per_batch
Indicates the approximate number of rows of data in the data file.
By default, all the data in the data file is sent to the server as a single transaction, and the number of rows in the batch is unknown to the query optimizer. If you specify ROWS_PER_BATCH (with a value > 0) the server uses this value to optimize the bulk-import operation. The value specified for ROWS_PER_BATCH should approximately the same as the actual number of rows. For information about performance considerations, see "Remarks," later in this topic.
Here is the quote from later in the article:
If the number of pages to be flushed in a single batch exceeds an internal threshold, a full scan of the buffer pool might occur to identify which pages to flush when the batch commits. This full scan can hurt bulk-import performance. A likely case of exceeding the internal threshold occurs when a large buffer pool is combined with a slow I/O subsystem. To avoid buffer overflows on large machines, either do not use the TABLOCK hint (which will remove the bulk optimizations) or use a smaller batch size (which preserves the bulk optimizations).
Because computers vary, we recommend that you test various batch sizes with your data load to find out what works best for you.
Personally I would just insert all 695 rows in a single batch. Tuning the batch size can make a big difference when inserting lots of data though.
4. Make sure that you need the IDENTITY
column
I don't know anything about your data model or requirements, but don't fall into the trap of adding an IDENTITY
column to every table. Aaron Bertrand has an article about this called Bad habits to kick : putting an IDENTITY column on every table. To be clear, I'm not saying that you should remove the IDENTITY
column from this table. However, if you determine that the IDENTITY
column is not necessary and remove it that could improve insert performance.
5. Disable indexes or constraints
If you're loading a large amount of data into a table compared to what you already have then it may be faster to disable indexes or constraints before the load and to enable them after the load. For large amounts of data it is usually more inefficient for SQL Server to build an index all at once instead of as data is loaded into the table. It looks like you inserted 695 rows into a table with 11500 rows, so I would not recommend this technique.
6. Consider TF 610
Trace Flag 610 allows minimal logging in some additional scenarios. For your table with an IDENTITY
clustered key, you would get minimal logging for any new data pages as long as your recovery model is simple or bulk-logged. I believe this feature is not on by default because it may degrade performance on some systems. You would need to test carefully before enabling this trace flag. The recommended Microsoft reference still appears to be The Data Loading Performance Guide
I/O Impact of Minimal Logging Under Trace Flag 610
When you commit a bulk load transaction that was minimally logged, all of the loaded pages must be flushed to disk before the commit completes. Any flushed pages not caught by an earlier checkpoint operation can create a great deal of random I/O. Contrast this with a fully logged operation, which creates sequential I/O on the log writes instead and does not require loaded pages to be flushed to disk at commit time.
If your load scenario is small insert operations on btrees that do not cross checkpoint boundaries, and you have a slow I/O system, using minimal logging can actually slow down insert speeds.
As far as I can tell this doesn't have anything to do with trace flag 610, but rather with minimal logging itself. I believe the earlier quote about ROWS_PER_BATCH
tuning was getting at this same concept.
In conclusion, there probably isn't much that you can do to tune your BULK INSERT
. I wouldn't be concerned about the read count that you observed with your insert. SQL Server will report reads any time you insert data. Consider the following very simple INSERT
:
DROP TABLE IF EXISTS X_TABLE;
CREATE TABLE X_TABLE (
VAL VARCHAR(1000) NOT NULL
);
SET STATISTICS IO, TIME ON;
INSERT INTO X_TABLE WITH (TABLOCK)
SELECT REPLICATE('Z', 1000)
FROM dbo.GetNums(10000); -- generate 10000 rows
Output from SET STATISTICS IO, TIME ON
:
Table 'X_TABLE'. Scan count 0, logical reads 11428
I have 11428 reported reads but that is not actionable information. Sometimes the number of reported reads can be reduced by minimal logging, but of course the difference cannot be directly translated into a performance gain.
I'm going to start answering this question, with the intention of continuously updating this answer as I build a knowledge base of tricks. Hopefully others come across this and help me improve my own knowledge in the process.
Gut Check: Is your firewall doing stateful, deep packet inspection? You won't find much on the Internet about this, but if your bulk inserts are about 10x slower than what they should be, chances are you have a security appliance doing Level 3-7 deep packet inspection and checking for "Generic SQL Injection Prevention".
Measure the size of the data you plan to bulk insert, in bytes, per batch. And check if you are storing any LOB data, as that is a separate page fetch and write operation.
Several reasons why you should do it this way:
a. In AWS, Elastic Block Storage IOPS get broken down into bytes, not rows.
- See Amazon EBS Volume Performance on Linux Instances » I/O Characteristics and Monitoring for an explanation of what an EBS IOPS unit is
- Specifically, General Purpose SSD (gp2) volumes have "I/O Credits and Burst Performance" concept and it is common for heavy ETL processing to deplete burst balance credits. Your burst duration is measured in bytes, not SQL Server rows :)
b. While most libraries or whitepapers test based on number of rows, it's really the number of pages that can be written to that matter, and, in order to calculate that, you need to know how many bytes per row and your page size (usually 8KB, but always double check if you inherited the system from someone else.)
SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(),OBJECT_ID(N'YourTable'), NULL, NULL, 'DETAILED')
Pay attention to avg_record_size_in_bytes and page_count.
c. As Paul White explains in https://sqlperformance.com/2019/05/sql-performance/minimal-logging-insert-select-heap, "To enable minimal logging with
INSERT...SELECT
, SQL Server must expect more than 250 rows with a total size of at least one extent (8 pages)."If you have any indexes with check constraints, or unique constraints, use
SET STATISTICS IO ON
andSET STATISTICS TIME ON
(or SQL Server Profiler or SQL Server Extended Events) to capture information like whether your bulk insert has any read operations. Read operations are due to the SQL Server database engine making sure the integrity constraints pass.Try creating a test database where the PRIMARY
FILEGROUP
is mounted on a RAM drive. This should be slightly faster than SSD but also eliminate any questions as to whether your RAID Controller might be adding overhead. In 2018, it shouldn't, but by creating multiple differential baselines like this one, you can get a general idea for how much overhead your hardware is adding.Also put the source file on a RAM drive as well.
Putting the source file on a RAM Drive will rule out any contention issues if you're reading the source file from the same drive your database server's FILEGROUP is on.
Verify that you've formatted your hard drive using 64KB extents.
Use UserBenchmark.com and benchmark your SSD. This will:
- Add more knowledge to other performance aficionados about what performance to expect from a device
- Help you figure out if your drive's performance is under-performing peers with the same exact drive
- Help you figure out if your drive's performance under-performs other drives in the same category (SSD, HDD, etc.)
If you're calling "INSERT BULK" from C# via Entity Framework Extensions, then make sure you "warm up" the JIT first and "throw away" the first few results.
Try creating Performance Counters for your program. With .NET, you can use benchmark.NET and it will automatically profile a bunch of basic metrics. You can then SHARE your profiler attempts with the open source community, and see if people running different hardware report the same metrics (viz. from my earlier point about using UserBenchmark.com to compare).
Try using named pipes and running it as localhost.
If you're targeting SQL Server and using .NET Core, consider spinning up a Linux with SQL Server Std Edition - this costs less than a dollar per hour even for serious hardware. The major advantage to trying the same code with the same hardware with a different OS is to see if the OS kernel's TCP/IP stack is causing problems.
Use Glen Barry's SQL Server Diagnostic Queries to measure drive latency for the drive storing the FILEGROUP of your database table.
a. Make sure to measure before your test, and after your test. The "before your test" just tells you whether you've got horrible IO characteristics as a baseline.
b. For measuring "during your test", you really need to use PerfMon Performance Counters.
Why? Because most database server's use some sort of network attached storage (NAS). In the cloud, in AWS, Elastic Block Storage is just that. You could be bound by the IOPS of your EBS volume/NAS solution.
Use some tool to measure wait statistics. Red Gate SQL Monitor, SolarWinds Database Performance Analyzer, or even Glen Barry's SQL Server Diagnostic Queries, or Paul Randal's Wait Statistics query.
a. The most common wait types will likely be Memory/CPU, WRITELOG, PAGEIOLATCH_EX, and ASYNC_NETWORK_IO.
b. You might incur additional wait types if you're running Availability Groups.
Measure the effects of multiple, simultaneous
INSERT BULK
commands withTABLOCK
disabled (TABLOCK will likely force serialization of INSERT BULK commands). Your bottleneck could be waiting for aINSERT BULK
to complete; you should try to queue up as many of these tasks as your database server's physical data model can handle.Consider partitioning your table. As a particular example: if your database table is append-only, Andrew Novick suggested creating a "TODAY"
FILEGROUP
and partitioning into at least two filegroups, TODAY and BEFORE_TODAY. In this way, if yourINSERT BULK
data is only data for today, you can filter on a CreatedOn field to force all inserts to hit a singleFILEGROUP
, and thereby reducing blocking when usingTABLOCK
. This technique is described in more detail in a Microsoft Whitepaper: Partitioned Table and Index Strategies Using SQL Server 2008If you're using columnstore indexes, turn off
TABLOCK
and load data in 102,400 rows Batch Size. You can then load all your data in parallel directly into columnstore rowgroups. This suggestion (and documented rational) comes from Microsoft's Columnstore indexes - Data loading guidance:Bulk loading has these built-in performance optimizations:
Parallel loads: You can have multiple concurrent bulk loads (bcp or bulk insert) that are each loading a separate data file. Unlike rowstore bulk loads into SQL Server, you don't need to specifyTABLOCK
because each bulk import thread will load data exclusively into a separate rowgroups (compressed or delta rowgroups) with exclusive lock on it. UsingTABLOCK
will force an exclusive lock on the table and you will not be able to import data in parallel.
Minimal logging: A bulk load uses minimal logging on data that goes directly to compressed rowgroups. Any data that goes to a delta rowgroup is fully logged. This includes any batch sizes that are less than 102,400 rows. However, with bulk loading the goal is for most of the data to bypass delta rowgroups.
Locking Optimization: When loading into compressed rowgroup, the X lock on rowgroup is acquired. However, when bulk loading into delta rowgroup, an X lock is acquired at rowgroup but SQL Server still locks the locks PAGE/EXTENT because X rowgroup lock is not part of locking hierarchy.As of SQL Server 2016, no longer need to enable trace flag 610 for minimal logging into indexed table. Quoting Microsoft engineer Parikshit Savjani (emphasis mine):
One of the design goals of SQL Server 2016 was to improve performance and scalability of the engine out of the box to make it run faster without the need of any knobs or trace flags for customers. As a part of these improvements, one of the enhancements made in the SQL Server engine code was turning on bulk load context (also referred as fast inserts or fast load context) and minimal logging by default when performing bulk load operations on database with simple or bulk logged recovery model. If you are not familiar with minimal logging, I would highly recommend reading this blog post from Sunil Agrawal where he explains how minimal logging works in SQL Server. For bulk inserts to be minimally logged, it still needs to meet the pre-requisite conditions which are documented here.
As part of these enhancements in SQL Server 2016, you no longer need to enable trace flag 610 for minimal logging into indexed table and it joins some of the other trace flags (1118, 1117, 1236, 8048) to become of part of the history. In SQL Server 2016, when the bulk load operation causes a new page to be allocated, all of the rows sequentially filling that new page are minimally logged if all the other pre-requisites for minimal logging discussed earlier are met. Rows inserted into existing pages (no new page allocation) to maintain index order are still fully logged, as are rows that are moved as a result of page splits during the load. It is also important to have ALLOW_PAGE_LOCKS turned ON for indexes (which is ON by default) for minimal logging operation to work as page locks are acquired during allocation and thereby only page or extent allocations are logged.If you're using SqlBulkCopy in C# or EntityFramework.Extensions (which uses SqlBulkCopy under the hood), then check your build configuration. Are you running your tests in Release mode? Is the Target Architecture set to Any CPU/x64/x86?
Consider using sp_who2 to see if the INSERT BULK transaction is SUSPENDED. It could be SUSPENDED because it is blocked by another spid. Consider reading How To Minimize SQL Server Blocking. You can also use Adam Machanic's sp_WhoIsActive, but sp_who2 will give you the basic information you need.
You might just have bad disk I/O. If your doing a bulk insert and your disk utilization is not hitting 100%, and is stuck at around 2%, then you probably have either bad firmware, or defective I/O device. (This happened to a coworker of mine.) Use [SSD UserBenchmark] to compare with others for hardware performance, especially if you can replicate the slowness on your local dev machine. (I put this last in the list because most companies do not allow developers to run databases on their local machine due to IP risk.)
If your table uses compression, you can try running multiple sessions, and in each session, start off with using an existing transaction and run this before the SqlBulkCopy command:
ALTER SERVER CONFIGURATION SET PROCESS AFFINITY CPU=AUTO;
For Continuous Loading, one stream of ideas, first outlined in a Microsoft whitepaper, Partitioned Table and Index Strategies Using SQL Server 2008:
Continuous Loading
In an OLTP scenario, new data may be coming in continuously. If users are querying the newest partition as well, inserting data continuously may lead to blocking: User queries may block the inserts, and similarly, inserts may block the user queries.
Contention on the loading table or partition can be reduced by using snapshot isolation—in particular, the
READ COMMITTED SNAPSHOT
isolation level. UnderREAD COMMITTED SNAPSHOT
isolation, inserts into a table do not cause activity in the tempdb version store, so the tempdb overhead is minimal for inserts, but no shared locks will be taken by user queries on the same partition.In other cases, when data is being inserted into a partitioned table continuously at a high rate, you may still be able to stage the data for short periods of time in staging tables and then insert that data into the newest partition repeatedly until the window for the current partition passes and data then is inserted into the next partition. For example, suppose you have two staging tables that receive 30 seconds worth of data each, on an alternate basis: one table for the first half of a minute, the second table for the second half of a minute. An insert stored procedure determines which half of the minute the current insert is in, and then it inserts into the first staging table. When 30 seconds is up, the insert procedure determines it must insert into the second staging table. Another stored procedure then loads the data from the first staging table into the newest partition of the table, and then it truncates the first staging table. After another 30 seconds, the same stored procedure inserts the data from the second stored procedure and puts it into the current partition, and then it truncates the second staging table.
Microsoft CAT Team's The Data Loading Performance Guide
Make sure your statistics are up to date. Use FULLSCAN if you can after each index build.
SAN Performance Tuning with SQLIO and also make sure if you are using mechanical disks that your disk partitions are aligned. See Microsoft's Disk Partition Alignment Best Practices.
COLUMNSTORE
INSERT
/UPDATE
performance
The reads are likely to be the unique & FK constraints being checked during insert - you may get an speed improvement if you can disable/drop them during the insert & enable/recreate them afterwards. You'll need to test if this makes it slower overall compared to keeping them active. This also may not be a good idea if other processes are writing to the same table concurrently. - Gareth Lyons
According to the Q & A Foreign keys become untrusted after bulk insert, FK constraints become untrusted after a BULK INSERT
with no CHECK_CONSTRAINTS
option (my case as I ended with untrusted constraints). It is not clear, but it would not make sense to check them and still make them untrusted. However, PK and UNIQUE will still be checked (see BULK INSERT (Transact-SQL)). - Alexei