Byte ordering for multibyte characters in SQL Server versus Oracle
What you are seeing is Little-Endian
encoding that SQL Server
uses to store Unicode
characters (more precisely, it uses UCS-2 LE
).
More on Little-Endian
here: Difference between Big Endian and little Endian Byte order
I don't know how it was possible that
When I use DUMP in Oracle or convert to VARBINARY in SQL Server the data matches exactly except for the bytes for this character
All the Unicode
characters stored in SQL Server
, converted to binary
, are "inverted", I mean, to see the real codes you should divide them in groups of 2 bytes
and invert the order within every pair.
Example:
declare @str varchar(3) = 'abc';
declare @str_n nvarchar(3) = N'abc';
select cast(@str as varbinary(3));
select cast(@str_n as varbinary(6));
The result is
0x616263
0x610062006300
As you see in case of Unicode
characters bytes are inverted:
"a" is represented as 0x6100
and not as 0x0061
.
The same story is about 0x25E6
that is real Unicode
code while in binary
representation in SQL Server
you see it as 0xE625
, i.e. inverted
.
The collation of an NVARCHAR
/ NCHAR
/ NTEXT
column has no bearing on the encoding used to store the data in that column. NVARCHAR
data is always UTF-16 Little Endian (LE). The collation of NVARCHAR
data only impacts sorting and comparison. Collation does impact the encoding of VARCHAR
data since the collation determines the code page used to store the data in that column / variable / literal, but we aren't dealing with that here.
As sepupic mentioned, what you are seeing when you view the data in binary form is a difference in endianness (Oracle is using Big Endian while SQL Server is using Little Endian). HOWEVER, what you are seeing when you view the binary form of the string in Oracle is not how the data is actually being stored. You are using AL32UTF8
which is UTF-8, which encodes that character in 3 bytes, not 2, as: E2, 97, A6
.
Also, it is not possible for the hashes to be the same for rows of just "a" but not when they include "◦", not unless the hashing in Oracle was done without conversion, hence using the UTF-8 encoding, and the hashing in SQL Server accidentally converting to VARCHAR
first. Otherwise there is no hash algorithm that will behave as you are describing, as you can verify by running the following in SQL Server:
DECLARE @Algorithm NVARCHAR(50) = N'MD4';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'MD5';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA1';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_256';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
SET @Algorithm = N'SHA2_512';
SELECT HASHBYTES(@Algorithm, 0x3100), HASHBYTES(@Algorithm, 0x0031);
In Oracle, you should use the CONVERT
function to get the string into the AL16UTF16LE
encoding, and then hash that value. That should match up to what SQL Server has. For example, you can see the different encoding forms of White Bullet (U+25E6) and how using CONVERT
along with AL16UTF16LE
will correct this on dbfiddle and below:
SELECT DUMP(CHR(14849958), 1016) AS "UTF8",
DUMP(CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
DUMP(CONVERT(CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"
FROM DUAL;
SELECT DUMP('a' || CHR(14849958), 1016) AS "UTF8",
DUMP('a' || CHR(9702 USING NCHAR_CS), 1016) AS "UTF16BE",
DUMP(CONVERT('a' || CHR(9702 USING NCHAR_CS), 'AL16UTF16LE' ), 1016) AS "UTF16LE"
FROM DUAL;
That returns:
UTF8: Typ=1 Len=3 CharacterSet=AL32UTF8: e2,97,a6
UTF16BE: Typ=1 Len=2 CharacterSet=AL16UTF16: 25,e6
UTF16LE: Typ=1 Len=2 CharacterSet=AL16UTF16: e6,25
UTF8: Typ=1 Len=4 CharacterSet=AL32UTF8: 61,e2,97,a6
UTF16BE: Typ=1 Len=4 CharacterSet=AL16UTF16: 0,61,25,e6
UTF16LE: Typ=1 Len=4 CharacterSet=AL16UTF16: 61,0,e6,25
As you can see in the 3rd column, the character set is misreported as being Big Endian when it is clearly Little Endian based on the order of the two bytes. You can also see that both characters are two bytes in UTF-16, and the order of both of them is different between Big and Little Endian, not just the characters that are > 1 byte in UTF-8.
Given all of this, since the data is being stored as UTF-8 yet you are seeing it as UTF-16 Big Endian via the DUMP
function, it seems like you are already converting it to UTF-16, but probably not realizing that the default UTF-16 in Oracle is Big Endian.
Looking at the "UTF-16" definition on the Glossary page of the Oracle documentation, it states (I broke the following sentences into two parts so it would be easier to distinguish between BE and LE):
AL16UTF16 implements the big-endian encoding scheme of the UTF-16 encoding form (more significant byte of each code unit comes first in memory). AL16UTF16 is a valid national character set.
and:
AL16UTF16LE implements the little-endian UTF-16 encoding scheme. It is a conversion-only character set, valid only in character set conversion functions such as SQL
CONVERT
or PL/SQLUTL_I18N.STRING_TO_RAW
.
P.S. Since you are using AL32UTF8
in Oracle, you should be using the Latin1_General_100_CI_AS_SC
collation in SQL Server, not Latin1_General_CI_AS
. The one you are using is older and doesn't fully support Supplementary Characters (no data loss if they exist, but built-in functions handle them as 2 characters instead of a single entity).