For SQL Server, how to fix simultaneous parallel table updates?
Given the code in your answer, you would most likely improve performance by doing the following two changes:
Start the query batch with
BEGIN TRAN
and end the batch withCOMMIT TRAN
:Decrease the number of updates per batch to less than 5000 to avoid lock escalation (which generally occurs at 5000 locks). Try 4500.
Doing those two things should decrease the massive amount of tran log writes and lock / unlock operations that you are currently generating by doing individual DML statements.
Example:
conn.Open();
using (SqlCommand cmd = new SqlCommand(
@"BEGIN TRAN;
UPDATE [TestTable] SET Column5 = 'some unique value' WHERE ID = 1;
UPDATE [TestTable] SET Column5 = 'some unique value' WHERE ID = 2;
...
UPDATE [TestTable] SET Column5 = 'some unique value' WHERE ID = 4500;
COMMIT TRAN;
", conn));
UPDATE
The Question is a bit sparse on the details. The example code is only shown in an answer.
One area of confusion is that the description mentions updating two columns, yet the example code only shows a single column being updated. My answer above was based on the code, hence it only shows a single column. If there really are two columns to update, then both columns should be updated in the same UPDATE
statement:
conn.Open();
using (SqlCommand cmd = new SqlCommand(
@"BEGIN TRAN;
UPDATE [TestTable]
SET Column5 = 'some unique value',
ColumnN = 'some unique value'
WHERE ID = 1;
UPDATE [TestTable]
SET Column5 = 'some unique value',
SET ColumnN = 'some unique value'
WHERE ID = 2;
...
UPDATE [TestTable]
SET Column5 = 'some unique value',
SET ColumnN = 'some unique value'
WHERE ID = 4500;
COMMIT TRAN;
", conn));
Another issue that is unclear is where is the "unique" data coming from? The Question mentions that the unique values are GUIDs. Are these being generated in the app layer? Are they coming from another data source that the app layer knows about and the database does not? This is important because, depending on the answers to these questions, it might make sense to ask:
- Can the GUIDs be generated in SQL Server instead?
- If yes to #1, then is there any reason to generate this code from app code instead of doing a simple batch loop in T-SQL?
If "yes" to #1 but the code, for whatever reason, needs to be generated in .NET, then you can use NEWID()
and generate UPDATE
statements that work on ranges of rows, in which case you do not need the BEGIN TRAN
/ 'COMMIT` since each statement can handle all 4500 rows in one shot:
conn.Open();
using (SqlCommand cmd = new SqlCommand(
@"UPDATE [TestTable]
SET Column5 = NEWID(),
ColumnN = NEWID()
WHERE ID BETWEEN 1 and 4500;
", conn));
If "yes" to #1 and there is no real reason for these UPDATEs to be generated in .NET, then you can just do the following:
DECLARE @BatchSize INT = 4500, -- this could be an input param for a stored procedure
@RowsAffected INT = 1, -- needed to enter loop
@StartingID INT = 1; -- initial value
WHILE (@RowsAffected > 0)
BEGIN
UPDATE TOP (@BatchSize) tbl
SET tbl.Column5 = NEWID(),
tbl.ColumnN = NEWID()
FROM [TestTable] tbl
WHERE tbl.ID BETWEEN @StartingID AND (@StartingID + @BatchSize - 1);
SET @RowsAffected = @@ROWCOUNT;
SET @StartingID += @BatchSize;
END;
The code above only works if the ID
values are not sparse, or at least if the values do not have gaps larger than @BatchSize
, such that there is at least 1 row updated in each iteration. This code also assumes that the ID
field is the Clustered Index. These assumptions seem reasonable given the provided example code.
However, if the ID
values do have large gaps, or if the ID
field is not the Clustered Index, then you can just test for rows that do not already have a value:
DECLARE @BatchSize INT = 4500, -- this could be an input param for a stored procedure
@RowsAffected INT = 1; -- needed to enter loop
WHILE (@RowsAffected > 0)
BEGIN
UPDATE TOP (@BatchSize) tbl
SET tbl.Column5 = NEWID(),
tbl.ColumnN = NEWID()
FROM [TestTable] tbl
WHERE tbl.Col1 IS NULL;
SET @RowsAffected = @@ROWCOUNT;
END;
BUT, if "no" to #1 and the values are coming from .NET for a good reason, such as the unique values per each ID
already exist in another source, then you can still speed this up (beyond my initial suggestion) by supplying a derived table:
conn.Open();
using (SqlCommand cmd = new SqlCommand(
@"BEGIN TRAN;
UPDATE tbl
SET tbl.Column5 = tmp.Col1,
tbl.ColumnN = tmp.Col2
FROM [TestTable] tbl
INNER JOIN (VALUES
(1, 'some unique value A', 'some unique value B'),
(2, 'some unique value C', 'some unique value D'),
...
(1000, 'some unique value N1', 'some unique value N2')
) tmp (ID, Col1, Col2)
ON tmp.ID = tbl.ID;
UPDATE tbl
SET tbl.Column5 = tmp.Col1,
tbl.ColumnN = tmp.Col2
FROM [TestTable] tbl
INNER JOIN (VALUES
(1001, 'some unique value A2', 'some unique value B2'),
(1002, 'some unique value C2', 'some unique value D2'),
...
(2000, 'some unique value N3', 'some unique value N4')
) tmp (ID, Col1, Col2)
ON tmp.ID = tbl.ID;
COMMIT TRAN;
", conn));
I believe the limit on the number of rows that can be joined via VALUES
is 1000, so I grouped two sets together in an explicit transaction. You could test with up to 4 sets of these UPDATE
s to do 4000 per transaction and keep below the limit of 5000 locks.
Based on your own answer, it looks like:
You're updating the first and second empty columns in separate update statements
The empty columns are varchar datatype
I don't have enough rep yet on DBA to comment (I originally saw the version of this you cross-posted on Stack Overflow), so will answer on that assumption.
If so, you're possibly making a common mistake for people who come to SQL from procedural languages: thinking about SQL tables procedurally, updating each row and column one-at-a-time.
SQL wants you to do set based operations, where you tell SQL what you want to do to all rows in one query/statement. The SQL Server query engine can then internally work out the best way to actually make that change happen to all rows. By doing updates row-at-a-time you're preventing SQL Server from doing the thing it does best.
It's possible you're well aware of this and the nature of the values you have to update makes row-by-row updates essential, but even then I think you could be updating both columns for a single table row in one go, halving the total number of updates you have to do.
If you have a degree of flexibility over what the unique values in your columns are, you can probably update one entire table with a single SQL query. For true GUID values, a query along the lines of:
update TestTable
set Column5 = NEWID()
,Column6 = NEWID()
will give you unique uniqueidentifier
values in each cell. NEWID is documented here if you've not seen it before. You would then just need to repeat this query 150 times for the separate tables, which could be parallelized easily; I'd bet on it being massively faster too.
Or, if you need something number-based, you can apply unique numbers like this:
with cteNumbers as (
select Column5
,Column6
,Row_Number() over (order by id) as RowNo
from TestTable
)
update cteNumbers
set Column5=RowNo
,Column6=RowNo
Though I suspect that's not what you're trying to do; and in any case, if your id
column is an autoincrementing int
, you could just use that directly rather than generating a Row_Number()
over it.
If you want something based on incrementing numbers but not just consisting of the number, you can build an expression around the RowNo
to achieve what you want.
Wherever possible, employing set-based operations is an absolute necessity for efficient SQL performance.
Found the solution. :)
Instead of running the 40K update queries at once (I create an update script of 40k update statements as stated in the comment above) if I decrease the that number to half of it - 20k update queries at once there is a huge improvement - 10 tables in parallel it takes a total of 1.3 minutes now - I can now continue.
Here is code which does the update:
Now the code has been changed to do 20k at a single time.
So basically previously it was running 10 (threads) X 40k update queries = 400k simultaneous update queries at the first run and then the rest 10 (threads) X 10k update queries, to update the all 50k records in those 10 different types.
And, now it does:
- 10 (threads) X 20k update queries = 200k simultaneous update queries
- 10 (threads) X 20k update queries = 200k simultaneous update queries
- 10 (threads) X 10k update queries = 100k update queries
Result: Before: 13 minutes, After: 1.8 minutes
I am now checking to find out the best (fastest!) combination to update those 150 tables using multiple threads at the same time. Probably I can update a higher number of tables in parallel with a lower simultaneous update like 5k (from 20k) but I will be busy testing that now.