SQL Server - Remove all non-printable ASCII characters
Here is a similar answer of the previous inline-table valued function answer (https://stackoverflow.com/a/43148897/2864740) for this question. The primary change is this uses a patindex
guard first and is much faster when only a small percentage of the rows contain characters that need to be replaced.
Thus the two huge improvements from the original scalar function:
Use an inline table-valued function. This is much faster as it allows SQL Server to directly in-line the code in the query plan. I try to avoid scalar functions in queries designed to scale, as a normal scalar function can be a huge performance sap (even with schemabinding) and prevent optimizations such as parallelism.
Use
patindex
for an initial guard check. This changes the number of characters SQL must exmine when there are no control characters to replace fromO(n * num_replace_calls)
to~O(n)
. Since most of the data data (in my case) contains no control characters, this can result in a huge performance increase.
-- Only accepts VARCHAR(8000) to avoid a conversion to VARCHAR(MAX);
-- use the suitable input type, which might even be NVARCHAR(MAX).
CREATE FUNCTION DropControlCharactersTv(@str VARCHAR(8000))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT CleanedString = CASE
-- No-op.
WHEN @str IS NULL or @str = '' THEN @str
-- If any of the non-wanted characters are found then go through the string and replace each occurrence of every character.
-- This guard significantly improves the performance when very few strings need to be corrected.
WHEN PATINDEX (
'%[' + CHAR(0) + CHAR(1) + CHAR(2) + CHAR(3) + CHAR(4) + CHAR(5) + CHAR(6) + CHAR(7) + CHAR(8) + CHAR(11) + CHAR(12) + CHAR(14) + CHAR(15) + CHAR(16) + CHAR(17) + CHAR(18) + CHAR(19) + CHAR(20) + CHAR(21) + CHAR(22) + CHAR(23) + CHAR(24) + CHAR(25) + CHAR(26) + CHAR(27) + CHAR(28) + CHAR(29) + CHAR(30) + CHAR(31) + CHAR(127) + ']%',
@str COLLATE Latin1_General_BIN) <> 0 THEN
-- Replace, nested.
-- See https://www.sqlservercentral.com/forums/topic/how-to-remove-characters-char0-to-char31
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
@str
,CHAR(0),'') COLLATE Latin1_General_BIN
,CHAR(1),'') COLLATE Latin1_General_BIN
,CHAR(2),'') COLLATE Latin1_General_BIN
,CHAR(3),'') COLLATE Latin1_General_BIN
,CHAR(4),'') COLLATE Latin1_General_BIN
,CHAR(5),'') COLLATE Latin1_General_BIN
,CHAR(6),'') COLLATE Latin1_General_BIN
,CHAR(7),'') COLLATE Latin1_General_BIN
,CHAR(8),'') COLLATE Latin1_General_BIN
,CHAR(9),'') COLLATE Latin1_General_BIN
,CHAR(10),'') COLLATE Latin1_General_BIN
,CHAR(11),'') COLLATE Latin1_General_BIN
,CHAR(12),'') COLLATE Latin1_General_BIN
,CHAR(13),'') COLLATE Latin1_General_BIN
,CHAR(14),'') COLLATE Latin1_General_BIN
,CHAR(15),'') COLLATE Latin1_General_BIN
,CHAR(16),'') COLLATE Latin1_General_BIN
,CHAR(17),'') COLLATE Latin1_General_BIN
,CHAR(18),'') COLLATE Latin1_General_BIN
,CHAR(19),'') COLLATE Latin1_General_BIN
,CHAR(20),'') COLLATE Latin1_General_BIN
,CHAR(21),'') COLLATE Latin1_General_BIN
,CHAR(22),'') COLLATE Latin1_General_BIN
,CHAR(23),'') COLLATE Latin1_General_BIN
,CHAR(24),'') COLLATE Latin1_General_BIN
,CHAR(25),'') COLLATE Latin1_General_BIN
,CHAR(26),'') COLLATE Latin1_General_BIN
,CHAR(27),'') COLLATE Latin1_General_BIN
,CHAR(28),'') COLLATE Latin1_General_BIN
,CHAR(29),'') COLLATE Latin1_General_BIN
,CHAR(30),'') COLLATE Latin1_General_BIN
,CHAR(31),'') COLLATE Latin1_General_BIN
,CHAR(127),'') COLLATE Latin1_General_BIN
-- Did not match pattern: inherently valid
ELSE @str END
In a query:
select
Plucker.CleanedString
from Goose d
cross apply DropControlCharactersTv(d.Turkey) as Plucker
In-line version:
create function [dbo].[remove_non_printable_chars] (@input_string nvarchar(max))
returns table with schemabinding as return (
select
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
replace(replace(@input_string collate latin1_general_100_bin2,
char(1), ''),char(2), ''),char(3), ''),char(4), ''),char(5), ''),char(6), ''),char(7), ''),char(8), ''),char(9), ''),char(10), ''),
char(11), ''),char(12), ''),char(13), ''),char(14), ''),char(15), ''),char(16), ''),char(17), ''),char(18), ''),char(19), ''),char(20), ''),
char(21), ''),char(22), ''),char(23), ''),char(24), ''),char(25), ''),char(26), ''),char(27), ''),char(28), ''),char(29), ''),char(30), ''),
char(31), ''), char(0) , '')
as clean_string
);
go
And use it like so:
select c.clean_string
from dbo.remove_non_printable_chars(@dirtystring) c
or
select ...
, c.clean_string
from t
cross apply dbo.remove_non_printable_chars(t.dirty_string) c
Reference:
- When is a SQL function not a function? "If it’s not inline, it’s rubbish." - Rob Farley
- Inline Scalar Functions - Itzik Ben-Gan
- Scalar functions, inlining, and performance: An entertaining title for a boring post - Adam Machanic
- TSQL User-Defined Functions: Ten Questions You Were Too Shy To Ask - Robert Sheldon
Another Option.
This function will replace control characters and correct any residual repeating spaces. For example Jane Smith{13}was here
will not be returned as Jane Smithwas here
, but rather Jane Smith was here
CREATE FUNCTION [dbo].[udf-Str-Strip-Control](@S varchar(max))
Returns varchar(max)
Begin
;with cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
cte2(C) As (Select Top (32) Char(Row_Number() over (Order By (Select NULL))-1) From cte1 a,cte1 b)
Select @S = Replace(@S,C,' ')
From cte2
Return ltrim(rtrim(replace(replace(replace(@S,' ','†‡'),'‡†',''),'†‡',' ')))
End
--Select [dbo].[udf-Str-Strip-Control]('Michael '+char(13)+char(10)+'LastName') --Returns: Michael LastName