Can I optimize this MERGE statement?
Personally, I don't like MERGE
because there are many unresolved bugs:
- This blog post (scroll down to "Other
MERGE
issues") - This answer by AlexKuznetsov
- These posts on Paul White's blog
- I published a cautionary tip about
MERGE
here, including references to several bugs that haven't been addressed. - And this list may not even be exhaustive - given the number of bugs discovered by this small set of users, I have to wonder how many haven't been discovered yet?
MERGE
also gives a false sense of security in terms of optimistic concurrency and race conditions. See Dan Guzman's blog post for more details.
I'm not trying to be a fear-mongerer here. But I also find the syntax unintuitive and daunting. So I would only use it in cases where it's actually needed and I can prove that I'm not affected by any of the above issues. I don't know what I would ever possibly gain from using it for an operation that could only ever end in an UPDATE
anyway.
So here is how I would do it instead, using syntax that is much more familiar to me:
;WITH s AS
(
SELECT VisitorID, UserIpAddress FROM
(
SELECT
VisitorID,
UserIpAddress,
rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
FROM dbo.VisitorSession
WHERE UserIpAddress IS NOT NULL
AND CreateDate > @dt
) AS x
WHERE rn = 1
)
UPDATE c
SET c.UserIpAddress = s.UserIpAddress
FROM dbo.ShoppingCart AS c
INNER JOIN s
ON c.VisitorID = s.VisitorID;
You can also break this operation into chunks to reduce the impact on the transaction log which may, in turn, reduce the overall duration. I blogged about this here.
Here is how I would handle that approach:
DECLARE
@dt DATE = DATEADD(MONTH, -6, SYSDATETIME()),
@rc INT = 1;
WHILE @rc > 0
BEGIN
BEGIN TRANSACTION;
;WITH s AS
(
SELECT TOP (100000) VisitorID, UserIpAddress FROM
(
SELECT
VisitorID,
UserIpAddress,
rn = ROW_NUMBER() OVER (PARTITION BY VisitorID ORDER BY CreateDate DESC)
FROM dbo.VisitorSession AS s
WHERE UserIpAddress IS NOT NULL
AND CreateDate > @dt
AND EXISTS
(
SELECT 1 FROM dbo.ShoppingCart AS c
WHERE c.VisitorID = s.VisitorID
AND (c.UserIpAddress <> s.UserIpAddress
OR c.UserIpAddress IS NULL)
)
) AS x
WHERE rn = 1
)
UPDATE c
SET c.UserIpAddress = s.UserIpAddress
FROM dbo.ShoppingCart AS c
INNER JOIN s
ON c.VisitorID = s.VisitorID;
SET @rc = @@ROWCOUNT;
COMMIT TRANSACTION;
END
Of course, as the blog post illustrates, you can gain just about as much time by making sure your log is big enough to handle the entire transaction without having to grow - most of the delay is likely coming from many, many autogrow operations accommodating your large transaction. Sadly until you've done the operation once it can be very difficult to try and guess how much transaction log you'll need...
I ran a test against Aaron Bertrand's Common Table Expression (CTE). The performance was nearly identical (07:56 for the CTE verses 07:55 for the merge--only testing once each).
Next, I attempted Aaron's chunk'ed statement (to prevent blocking). However, I ran into an issue with grouping the records on every pass. The performance dropped to unacceptable levels because of this constraint on a such a large table. On the upside, his formulation gave me an idea on how to reduce the running time and avoid using MERGE.
Switching to a CTE was an easy choice. By reducing the result set from the smaller joining table first (the one with 9,000,000 records), the total records required to join on, and update, reduced significantly.
/* results from shared development server:
run #1: 00:52 with 193,379 rows affected
run #2: 00:56 with 193,379 rows affected
run #3: 02:19 with 193,379 rows affected
run #4: 00:15 with 193,379 rows affected
run #5: 01:59 with 193,379 rows affected
run #6: 01:14 with 193,379 rows affected
*/
declare @dt datetime = cast(dateadd(month, -6, sysdatetime()) as date);
;with cte as (
select distinct -- 00:26, with 179,160 rows
visitorid
from ShoppingCart (nolock) sc
where CreateDate > @dt
)
update ShoppingCart
set UserIpAddress = desired.UserIPAddress
from ShoppingCart
join (
select * -- 02:36 with 179,016 rows
from (
select
vs.visitorid
,vs.useripaddress
,row_number() over
(partition by vs.visitorid order by vs.createdate desc) as rownumber
from visitorsession (nolock) vs
join cte on vs.VisitorID = cte.VisitorID
where vs.CreateDate > @dt
and vs.UserIPAddress is not null
) as subTbl
where rownumber = 1
) as desired on ShoppingCart.VisitorID = desired.VisitorID
where ShoppingCart.CreateDate > @dt;
Feeding a much smaller result set from the CTE into the sub-select for the UPDATE radically increased performance on the UPDATE. Preserving the 6 month timeframe constraint on each step also helped keep the result set smaller for SQL Server to optimize performance. The final query typically runs for about 1 minute in the development environment. In production, this will most likely run in much less time, and I don't expect any requests to timeout with our current configuration.