Byte ordering for multibyte characters in SQL Server versus Oracle

What you are seeing is Little-Endian encoding that SQL Server uses to store Unicode characters (more precisely, it uses UCS-2 LE).

More on Little-Endian here: Difference between Big Endian and little Endian Byte order

I don't know how it was possible that

When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character

All the Unicode characters stored in SQL Server, converted to binary, are "inverted", I mean, to see the real codes you should divide them in groups of 2 bytes and invert the order within every pair.

Example:

Click to copy

declare @str varchar(3) = 'abc';
declare @str_n nvarchar(3) = N'abc';

select cast(@str as varbinary(3));
select cast(@str_n as varbinary(6));

The result is

Click to copy

0x616263

0x610062006300

As you see in case of Unicode characters bytes are inverted: "a" is represented as 0x6100 and not as 0x0061.

The same story is about 0x25E6 that is real Unicode code while in binary representation in SQL Server you see it as 0xE625, i.e. inverted.

The collation of an NVARCHAR / NCHAR / NTEXT column has no bearing on the encoding used to store the data in that column. NVARCHAR data is always UTF-16 Little Endian (LE). The collation of NVARCHAR data only impacts sorting and comparison. Collation does impact the encoding of VARCHAR data since the collation determines the code page used to store the data in that column / variable / literal, but we aren't dealing with that here.

As sepupic mentioned, what you are seeing when you view the data in binary form is a difference in endianness (Oracle is using Big Endian while SQL Server is using Little Endian). HOWEVER, what you are seeing when you view the binary form of the string in Oracle is not how the data is actually being stored. You are using AL32UTF8 which is UTF-8, which encodes that character in 3 bytes, not 2, as: E2, 97, A6.

Also, it is not possible for the hashes to be the same for rows of just "a" but not when they include "◦", not unless the hashing in Oracle was done without conversion, hence using the UTF-8 encoding, and the hashing in SQL Server accidentally converting to VARCHAR first. Otherwise there is no hash algorithm that will behave as you are describing, as you can verify by running the following in SQL Server:

Click to copy

DECLARE @Algorithm NVARCHAR(50) = N'MD4';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'MD5';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA1';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_256';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_512';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);

In Oracle, you should use the CONVERT function to get the string into the AL16UTF16LE encoding, and then hash that value. That should match up to what SQL Server has. For example, you can see the different encoding forms of White Bullet (U+25E6) and how using CONVERT along with AL16UTF16LE will correct this on dbfiddle and below:

Click to copy

SELECT DUMP(CHR(14849958), 1016) AS "UTF8",
       DUMP(CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
       DUMP(CONVERT(CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"
FROM DUAL;

SELECT DUMP('a' || CHR(14849958), 1016) AS "UTF8",
       DUMP('a' || CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
       DUMP(CONVERT('a' || CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"
FROM DUAL;

That returns:

Click to copy

UTF8:     Typ=1 Len=3 CharacterSet=AL32UTF8: e2,97,a6
UTF16BE:  Typ=1 Len=2 CharacterSet=AL16UTF16: 25,e6
UTF16LE:  Typ=1 Len=2 CharacterSet=AL16UTF16: e6,25


UTF8:     Typ=1 Len=4 CharacterSet=AL32UTF8: 61,e2,97,a6
UTF16BE:  Typ=1 Len=4 CharacterSet=AL16UTF16: 0,61,25,e6
UTF16LE:  Typ=1 Len=4 CharacterSet=AL16UTF16: 61,0,e6,25

As you can see in the 3rd column, the character set is misreported as being Big Endian when it is clearly Little Endian based on the order of the two bytes. You can also see that both characters are two bytes in UTF-16, and the order of both of them is different between Big and Little Endian, not just the characters that are > 1 byte in UTF-8.

Given all of this, since the data is being stored as UTF-8 yet you are seeing it as UTF-16 Big Endian via the DUMP function, it seems like you are already converting it to UTF-16, but probably not realizing that the default UTF-16 in Oracle is Big Endian.

Looking at the "UTF-16" definition on the Glossary page of the Oracle documentation, it states (I broke the following sentences into two parts so it would be easier to distinguish between BE and LE):

AL16UTF16 implements the big-endian encoding scheme of the UTF-16 encoding form (more significant byte of each code unit comes first in memory). AL16UTF16 is a valid national character set.

and:

AL16UTF16LE implements the little-endian UTF-16 encoding scheme. It is a conversion-only character set, valid only in character set conversion functions such as SQL CONVERT or PL/SQL UTL_I18N.STRING_TO_RAW.

P.S. Since you are using AL32UTF8 in Oracle, you should be using the Latin1_General_100_CI_AS_SC collation in SQL Server, not Latin1_General_CI_AS. The one you are using is older and doesn't fully support Supplementary Characters (no data loss if they exist, but built-in functions handle them as 2 characters instead of a single entity).

Byte ordering for multibyte characters in SQL Server versus Oracle

Tags:

Sql Server

Oracle

Unicode

Encoding

Hashing

Related

Recent Posts