Can I refactor this query to get it to run in parallel?
The UDF is preventing parallelism. It also is causing that spool.
You could use CLR and a compiled regex to do your search and replace. It doesn't block parallelism as long as the required attributes are present and will likely be significantly faster than performing 300 TSQL REPLACE
operations per function call.
Example code is below.
DECLARE @X XML =
(
SELECT Names AS [@find],
Replacement AS [@replace]
FROM dbo.NamesMultiWord
ORDER BY [WordLength] DESC
FOR XML PATH('x'), ROOT('spec')
);
UPDATE dbo.Deidentified WITH (TABLOCK)
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml, @X),
DE461 = dbo.ReplaceMultiWord(DE461, @X),
DE87 = dbo.ReplaceMultiWord(DE87, @X),
DE15 = dbo.ReplaceMultiWord(DE15, @X)
WHERE InProcess = 1;
This depends on the existence of a CLR UDF as below (the DataAccessKind.None
should mean the spool disappears as well as that is there for Halloween protection and isn't needed as this doesn't access the target table).
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
using System.Collections.Generic;
using System.Xml;
public partial class UserDefinedFunctions
{
//TODO: Concurrency?
private static readonly Dictionary<string, ReplaceSpecification> cachedSpecs =
new Dictionary<string, ReplaceSpecification>();
[SqlFunction(IsDeterministic = true,
IsPrecise = true,
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None)]
public static SqlString ReplaceMultiWord(SqlString inputString, SqlXml replacementSpec)
{
//TODO: Implement something to drop things from the cache and use a shorter key.
string s = replacementSpec.Value;
ReplaceSpecification rs;
if (!cachedSpecs.TryGetValue(s, out rs))
{
var doc = new XmlDocument();
doc.LoadXml(s);
rs = new ReplaceSpecification(doc);
cachedSpecs[s] = rs;
}
string result = rs.GetResult(inputString.ToString());
return new SqlString(result);
}
internal class ReplaceSpecification
{
internal ReplaceSpecification(XmlDocument doc)
{
Replacements = new Dictionary<string, string>();
XmlElement root = doc.DocumentElement;
XmlNodeList nodes = root.SelectNodes("x");
string pattern = null;
foreach (XmlNode node in nodes)
{
if (pattern != null)
pattern = pattern + "|";
string find = node.Attributes["find"].Value.ToLowerInvariant();
string replace = node.Attributes["replace"].Value;
//TODO: Escape any special characters in the regex syntax
pattern = pattern + find;
Replacements[find] = replace;
}
if (pattern != null)
{
pattern = "(?:" + pattern + ")";
Regex = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
}
}
private Regex Regex { get; set; }
private Dictionary<string, string> Replacements { get; set; }
internal string GetResult(string inputString)
{
if (Regex == null)
return inputString;
return Regex.Replace(inputString,
(Match m) =>
{
string s;
if (Replacements.TryGetValue(m.Value.ToLowerInvariant(), out s))
{
return s;
}
else
{
throw new Exception("Missing replacement definition for " + m.Value);
}
});
}
}
}
Bottom line: Adding criteria to the WHERE
clause and splitting the query into four separate queries, one for each field allowed SQL server to provide a parallel plan and made the query run 4X as fast as it had without the extra test in the WHERE
clause. Splitting the queries into four without the test didn't do that. Neither did adding the test without splitting the queries. Optimizing the test reduced the total run time to 3 minutes (from the original 3 hours).
My original UDF took 3 hours 16 minutes to process 1,174,731 rows, with 1.216 GB of nvarchar data tested. Using the CLR provided by Martin Smith in his answer, the execution plan was still not parallel and the task took 3 hours and 5 minutes.
Having read that WHERE
criteria could help push an UPDATE
to parallel, I did the following. I added a function to the CLR module to see if the field had a match to the regex:
[SqlFunction(IsDeterministic = true,
IsPrecise = true,
DataAccess = DataAccessKind.None,
SystemDataAccess = SystemDataAccessKind.None)]
public static SqlBoolean CanReplaceMultiWord(SqlString inputString, SqlXml replacementSpec)
{
string s = replacementSpec.Value;
ReplaceSpecification rs;
if (!cachedSpecs.TryGetValue(s, out rs))
{
var doc = new XmlDocument();
doc.LoadXml(s);
rs = new ReplaceSpecification(doc);
cachedSpecs[s] = rs;
}
return rs.IsMatch(inputString.ToString());
}
and, in internal class ReplaceSpecification
, I added the code to execute the test against the regex
internal bool IsMatch(string inputString)
{
if (Regex == null)
return false;
return Regex.IsMatch(inputString);
}
If all the fields are tested in a single statement, SQL server does not parallelize the work
UPDATE dbo.DeidentifiedTest
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml, @X),
DE461 = dbo.ReplaceMultiWord(DE461, @X),
DE87 = dbo.ReplaceMultiWord(DE87, @X),
DE15 = dbo.ReplaceMultiWord(DE15, @X)
WHERE InProcess = 1
AND (dbo.CanReplaceMultiWord(IndexedXml, @X) = 1
OR DE15 = dbo.ReplaceMultiWord(DE15, @X)
OR dbo.CanReplaceMultiWord(DE87, @X) = 1
OR dbo.CanReplaceMultiWord(DE15, @X) = 1);
Time to execute over 4 1/2 hours and still running. Execution plan:
However, if the fields are separated into separate statements a parallel work plan is used, and my CPU usage goes from 12% with the serial plans to 100% with the parallel plans (8 cores).
UPDATE dbo.DeidentifiedTest
SET IndexedXml = dbo.ReplaceMultiWord(IndexedXml, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(IndexedXml, @X) = 1;
UPDATE dbo.DeidentifiedTest
SET DE461 = dbo.ReplaceMultiWord(DE461, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(DE461, @X) = 1;
UPDATE dbo.DeidentifiedTest
SET DE87 = dbo.ReplaceMultiWord(DE87, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(DE87, @X) = 1;
UPDATE dbo.DeidentifiedTest
SET DE15 = dbo.ReplaceMultiWord(DE15, @X)
WHERE InProcess = 1
AND dbo.CanReplaceMultiWord(DE15, @X) = 1;
Time to execute 46 minutes. Row statistics showed that about 0.5% of the records had at least one regex match. Execution plan:
Now, the main drag on time was the WHERE
clause. I then replaced the regex test in the WHERE
clause with the Aho-Corasick algorithm implemented as a CLR. This reduced total time to 3 minutes 6 seconds.
This required the following changes. Load the assembly and functions for the Aho-Corasick algorithm. Change the WHERE
clause to
WHERE InProcess = 1 AND dbo.ContainsWordsByObject(ISNULL(FieldBeingTestedGoesHere,'x'), @ac) = 1;
And add the following before the first UPDATE
DECLARE @ac NVARCHAR(32);
SET @ac = dbo.CreateAhoCorasick(
(SELECT NAMES FROM dbo.NamesMultiWord FOR XML RAW, root('root')),
'en-us:i'
);