Replace special characters in a column with space
If you are guaranteed to only ever use the 26 letters of the US English alphabet (both upper-case and lower-case versions) then sure, you can get away with using LIKE
and/or PATINDEX
with the simple range notation of [a-z]
(you wouldn't need to use an upper-case "Z" when using a case-insensitive Collation).
But, if you might get characters not found in the en-US alphabet yet available in various Code Pages / Collations for VARCHAR
data (e.g. Þ
= Latin capital "Thorn" = SELECT CHAR(0xDE)
), then you might need to include those in the character class: [a-z0-9, Þ]
. Of course, what those extra characters would be is on a per-Code Page basis.
Also, please be aware that both Collation type (SQL Server vs Windows) and sensitivity settings (case, accent, etc sensitive vs insensitive) will affect which characters are included in a particular range. For example, the SQL Server Collations sort upper-case and lower-case letters in the opposite order as the Windows Collations. Meaning, assuming a case-sensitive Collation for both types of Collations, one will do AaBb...
and the other will do aAbB...
. The effect will be that a
will be within the range of A-Z
for one of them, but not the other. And the range of a-Z
won't match any characters in a binary Collation (one ending in either _BIN
or _BIN2
, but don't use _BIN
) given that the value of A
is 65 and a
is 97, hence it is an invalid range of 97 through 65 ;-). There are far too many variations to give examples for here so I will try to post a detailed explanation on my blog sometime soon (and then will update this with the link to it). However, if you are going to be strict about only accepting US English characters (even if you might get valid letters from other languages) then your best option will probably be to use the following pattern and Collation:
LIKE '%[^A-Za-z0-9, ]%' COLLATE Latin1_General_100_BIN2
Now, if you are supporting NVARCHAR
data and can get "word" characters from various languages, then T-SQL will not be of much help as it has no real way to differentiate these things. In this case, you should use a Regular Expression (RegEx) -- specifically the Replace
method / function -- and those are only available through SQLCLR. The following shows an example of replacing several "special" characters, yet leaving all that are valid letters in at least one language:
DECLARE @Test NVARCHAR(500);
SET @Test = N'this$is%a<>TEST,;to}⌡↕strip╞╟╚══¶out_ç_ƒ▀ special-ij-೫-chars-舛-დ-א-B';
SELECT SQL#.RegEx_Replace4k(@Test, N'[\W\p{Pc}-[,]]', N' ', -1, 1, NULL);
Returns:
this is a TEST, to strip out ç ƒ special ij ೫ chars 舛 დ א B
The RegEx expression means:
\W
= a RegEx "escape" meaning "any non-word character"\p{Pc}
= a Unicode "category" of "Punctuation, Connector" (this is needed for the match only because this "category" is specifically excluded by the\W
escape)-[,]
= class subtraction (this is needed to exclude commas from matching as "special" since they are included in the\W
escape)
You can do an update of a table simply by issuing:
UPDATE tbl
SET tbl.field = SQL#.RegEx_Replace4k(tbl.field, N'[\W\p{Pc}-[,]]', N' ', -1, 1, NULL)
FROM tbl
WHERE SQL#.RegEx_IsMatch4k(tbl.field, N'[\W\p{Pc}-[,]]', 1, NULL) = 1;
Please note that for these examples, I used two functions available in the Free version SQL# library of SQLCLR functions, which I created (but again, these are free). Also note that I used the "4k" versions which are faster due to using NVARCHAR(4000)
instead of NVARCHAR(MAX)
parameter types. If your data is using NVARCHAR(MAX)
, then just remove the "4k" from the function names.
Please also see:
- RegEx Character classes
I've got a post here that does something similar.
Basically I'm using a recursive CTE to go loop over and over again replacing one "bad" character at a time. I'm using STUFF to strip 1 character (although you can use it to replace with a space) and PATINDEX to find the location of the character I want to remove. You could modify it slightly to do what you are looking for. However it creates a "good" list, it doesn't actually update the existing list.
DECLARE @Pattern varchar(50) = '%[^A-Za-z0-9, ]%';
WITH FixBadChars AS (SELECT StringToFix, StringToFix AS FixedString, 1 AS MyCounter, Id
FROM BadStringList
UNION ALL
SELECT StringToFix, Stuff(FixedString, PatIndex(@Pattern,
FixedString COLLATE Latin1_General_BIN2), 1, ' ') AS FixedString,
MyCounter + 1, Id
FROM FixBadChars
WHERE FixedString COLLATE Latin1_General_BIN2 LIKE @Pattern)
SELECT StringToFix, FixedString, MyCounter, Id
FROM FixBadChars
WHERE MyCounter =
(SELECT MAX(MyCounter)
FROM FixBadChars Fixed
WHERE Fixed.Id = FixBadChars.Id)
OPTION (MAXRECURSION 1000);
You should be able to modify the bottom part to do an update rather than just a query but I haven't actually tried it. I'm fairly sure it would look something like this:
UPDATE FixBadChars
SET StringToFix = FixedString
WHERE MyCounter =
(SELECT MAX(MyCounter)
FROM FixBadChars Fixed
WHERE Fixed.Id = FixBadChars.Id)
OPTION (MAXRECURSION 1000);
As far as scaleability I returned ~170k cleaned rows in under 30 seconds. Again not sure about doing an update but this was on my laptop which is fairly slow with only 6gb ram.