Error converting VARCHAR(MAX) to XML due to "UTF-8" attribute
Your XML stored in a varchar(max) column should look like this.
<?xml version="1.0" encoding="UTF-8"?>
<PARAM>
<TAB DIM="30" ID="ZC3D2_1" SIZE="5">
<LIN NUM = "1">
<FLD NAME = "ZDOC" TYPE = "Char">Ferran López</FLD>
</LIN>
</TAB>
</PARAM>
The ó
should be represented with a double byte value ó
.
If you don't have a UTF-8 encoded string stored in your column, the right way to go about this is to remove the encoding from the XML before you convert the value to the XML datatype.
I think you have a deeper problem. UTF-8 allow for more characters than the regular non-Unicode collations in SQL server. So, to be safe you should either use SQL Server 2019 which has UTF-8 collations (and I sunderstand if that isn't doable/desirable for many reasons) os use (try) nvarchar instead of varchar.
If you are afraid of storage increase going from varchar to nvarchar, you can possibly use row compression. But that requires Enterprise Edition prior to SQL Server 2016.
What's happening here is:
- The
XML
type stores data internally as UTF-16 Little Endian (most of the time, at least). It doesn't matter what the source encoding is, the end-result will be UTF-16 LE (and no<xml>
tag, hence noencoding="..."
). - When converting a string to
XML
:- It's the bytes of the string that get converted, not the characters (will explain the difference in a moment)
NVARCHAR
data is assumed to be UTF-16 LE. If there is an<xml>
tag and it contains theencoding
attribute, the only valid value is"UTF-16"
.VARCHAR
data is assumed to be in the 8-bit code page associated with the collation of the data when there is no<xml>
tag, or if one exists but there is noencoding
attribute. Else the data will be interpreted as being encoded in the code page specified in theencoding
attribute (even though it is encoded in the code page associated with the collation of the data).
- Your data is most likely encoded as Windows code page 1252 (this is determined by the collation of the column that the data resides in, not the collation of the instance or even the database, but since you mention that the instance is using
Latin1_General_BIN
, it's safe-enough to assume for the moment that the column is using the same collation). - The code point for the
ó
character in code page Windows-1252 is: 0xF3. - The
<xml>
tag, however, is declaring that the XML data is encoded as UTF-8. - In UTF-8, 0xF3 must be followed by three bytes, each being between 0x80 and 0xBF, yet in your data it's followed by a
p
, which has a value of 0x70. Hence you get the "illegal xml character" error (because theencoding="UTF-8"
tells the conversion function that the bytes are valid UTF-8 bytes; the conversion doesn't see theó
character).
Your options are:
Ideally, the column would be converted to
XML
and theencoding
attribute of the<xml>
tag, or the entire<xml>
tag itself, would be removed on the way in. AND, theXML
datatype can save space if there are repeating element and/or attribute names as it creates a dictionary (lookup list) of names internally and records the structure using the ID values.Set the
[data]
column to use a UTF-8 collation (new in SQL Server 2019, so not an option for you)Set the
[data]
column to beNVARCHAR
, and remove theencoding
attribute of the<xml>
tag, or the entire<xml>
tag.Convert the incoming string into UTF-8 bytes. So the
ó
character is two bytes in UTF-8: 0xC3B3, which appear asó
in Windows-1252.DECLARE @Good VARCHAR(MAX) = '<?xml version="1.0" encoding="UTF-8"?><a>hell' + CONVERT(VARCHAR(MAX), 0xC3B3) + '</a>'; SELECT @Good, CONVERT(XML, @Good) -- <?xml version="1.0" encoding="UTF-8"?><a>helló</a> -- -- <a>helló</a>
NOTES:
- Simply removing the
encoding
attribute of the<xml>
tag, or the entire<xml>
tag, is not an option. Sure, it will work in this particular case, but it won't work in all cases due to the column beingVARCHAR
and UTF-8 collations not being available in SQL Server 2014. Hence, any Unicode characters not available in Windows code page 1252 will be converted to?
or??
(depending on BMP character or Supplementary Character):DECLARE @Test VARCHAR(MAX) = '<test>ó - ☢ - </test>'; SELECT @Test, CONVERT(XML, @Test); -- <test>ó - ? - ??</test> -- -- <test>ó - ? - ??</test>
- Do NOT simply change the collation of the column to a different locale / culture. While that might get rid of the error, it would only accomplish that by silently getting rid of the data that was causing the error. For example:
"Cyrillic" uses a different code page than "Latin1_General", and theDECLARE @Data NVARCHAR(MAX) = N'ó'; SELECT CONVERT(VARCHAR(MAX), @Data COLLATE Latin1_General_BIN) AS [Latin1_General], CONVERT(VARCHAR(MAX), @Data COLLATE Latin1_General_BIN) COLLATE Cyrillic_General_CI_AS AS [Cyrillic]; /* Latin1_General Cyrillic ó o */
ó
character is not available on the Cyrillic code page. But, there is a "Best Fit" mapping which is why we end up with ano
instead of a?
. - You, and anyone working on SQL Server 2008 or newer, really should be using the
_100_
level collations. Additionally, anyone working on SQL Server 2012 or newer should ideally be using the_100_
level collation that ends with_SC
(for Supplementary Characters). Finally, when needing a binary collation on SQL Server 2005 or newer, use one ending in_BIN2
(see my post here as to why). - This issue has nothing to do with whether the query is ad hoc or in a stored procedure (T-SQL or SQLCLR).