Is there a T-SQL equivalent for punctuation as [0-9] is for numbers and [a-z] is for letters?
The biggest difficulty in coming to a precise solution is in defining exactly what characters are to be included (or excluded, whichever direction makes more sense for the operation). Meaning:
- Are we talking about
VARCHAR
/ ASCII data orNVARCHAR
/ Unicode data? The list of punctuation characters for ASCII data depends on the Code Page which in turn depends on the Collation. (in this Question we are dealing with ASCII data). - Are we dealing with case-sensitive or case-insensitive searches?
- What Collation is the column set to? The Collation will tell us both the Code Page and case-sensitivity. (in this question we are dealing with
Latin1_General_CI_AS
) - is the term "punctuation" to mean just standard punctuation characters (e.g.
.
,,
,;
,:
, etc) or does it mean non-alphanumeric characters? - Are whitespace characters included?
- Are Control characters included?
- What about currency symbols such as
¢
,£
,¥
, etc? - What about symbols such as
©
and™
? - What characters are considered "alpha"? Are non-English characters such as
Â
,É
,Ñ
,ß
,Þ
included? - Since this Question deals with UK keyboards (see discussion for this Question), what about the
Æ
/æ
character?
In order to help facilitate clarity regarding expected behavior, the following query will show all 256 characters of the Latin1 character set (i.e. Code Page 1252) and how two variations of @Shaneis's proposed solution operate. The first field (labeled as Latin1_General_CI_AS
) shows the LIKE
clause as proposed by @Shaneis (as of this writing) and the second field (labeled as Latin1_General_100_BIN2
) shows a modification where I overrode the Collation to specify a binary one (i.e. a Collation ending in _BIN2
; the _BIN
Collations are deprecated so don't use them if you have access to the _BIN2
versions) which meant I also needed to add in the A-Z
range to filter out upper-case letters as the current Collation is case-insensitive:
;WITH nums AS
(
SELECT TOP (256) (ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1) AS [Decimal]
FROM [master].[sys].[all_objects]
)
SELECT nm.[Decimal],
CHAR(nm.[Decimal]) AS [Character],
CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9]%'
THEN 'x' ELSE '' END AS [Latin1_General_CI_AS],
CASE WHEN CHAR(nm.[Decimal]) LIKE '%[^a-z0-9A-Z]%' COLLATE Latin1_General_100_BIN2
THEN 'x' ELSE '' END AS [Latin1_General_100_BIN2]
FROM nums nm;
UPDATE
It should be mentioned that IF one is truly seeking to find characters that are classified as being "punctuation" (and not "currency symbol", "mathematical symbol", etc), and IF one is not prohibited from using SQLCLR / loading a custom Assembly (SQLCLR was introduced with SQL Server 2005, and I have yet to come across a good reason for not allowing it, especially since Azure SQL Database V12 supports SAFE
Assemblies), then you can use Regular Expressions, but not for the reason that most people would guess.
Rather than using Regular Expressions to build a more functional character range, or even rather than using something like \w
(meaning any "word" character), you can specify the Unicode Category of the characters you want to filter on, and there are several defined categories:
https://www.regular-expressions.info/unicode.html#category
You can even specify the Unicode Block to filter on, such as "InBengali" or "InDingbats" or "InOptical_Character_Recognition", etc:
https://www.regular-expressions.info/unicode.html#block
There are numerous examples of creating RegEx functions for SQL Server (though most examples don't follow SQLCLR best practices), or you can download the Free version of the SQL# library (which I created), and use the scalar RegEx_IsMatch function as follows:
SQL#.RegEx_IsMatch(Unicode-String-Expression, N'\p{P}', 1, NULL)
The \p{P}
expression means \p
= Unicode Category, and {P}
= all punctuation (as opposed to a specific type of punctuation, such as "Connector Punctuation"). AND, the "Punctuation" category includes all punctuation across all languages! You can see the full list on the Unicode.org site via the following link (there are currently 717 Code Points in that category):
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AGeneral_Category%3DPunctuation%3A%5D
An updated version of the test query shown above, including another field that uses SQL#.RegEx_IsMatch with \p{P}
, and the results of all 3 tests across all 256 characters of Code Page 1252 (i.e. Latin1_General) has been posted on PasteBin.com at:
T-SQL query and results for filtering types of characters
UPDATE
The following was mentioned in the related discussion:
You've made a good point about accented characters, with them being hotel names from around the world there will be accented characters in the names, for my problem I would want to class these as valid alpha characters.
In this case:
There are 11 non-English characters that are included in the Latin1 character set / Code Page that are not matched by the
a-z
range. They are:ð Ð Þ þ œ Œ š Š ž Ž Ÿ
. These need to be added to the wildcard, and while not necessary at the moment, it wouldn't hurt to add inA-Z
so that the pattern works just as well on a case-sensitive collation. The end result is:
LIKE '%[^a-zA-Z0-9ðÐÞþœŒšŠžŽŸ]%'
Considering that this data can include "hotel names from around the world", I would highly recommend changing the datatype of the column to be
NVARCHAR
so that you can store all characters from all languages. Keeping this asVARCHAR
runs a very high risk of eventually having data loss since you can only represent the Latin-based languages, and not even fully for those given the six supplemental Unicode categories that provide additional Latin-related characters.
I may be over-simplifying this a bit but, if we say that punctuation is all that is left when alphanumeric values are removed, then the following will search for strings that have non-alphanumeric characters in them.
Create Table #Test
(
Value VarChar(10)
)
Insert Into #Test
Values ('123a'), ('456b'), ('12ABC'),('AB!23'),('C?D789')
-- Original
Select *
From #Test
Where Value like '[0-9][0-9][0-9][a-z]'
-- Non Alpha-numeric
SELECT * FROM #Test WHERE Value LIKE '%[^a-z0-9]%';
DROP TABLE #Test;