Accent Sensitive Sort
The behavior you are seeing here is due, in a general sense, to the fact that the Unicode Collation Algorithm (UCA) allows for complex, multi-level sorting. More specifically:
Sorting is not Comparison:
Determining whether two strings are the same or different is fairly straight forward (given a particular locale/language and set of sensitivities). But determining the order of 2 or more strings can be highly complex.
Sorting is done in a series of steps, with each step applied to the entire string, not character by character:
- Standard: sort base characters (regardless of accent and case differences)
- IF Accent-sensitive, apply accent / diacritic weights
- IF Case-sensitive, apply casing weights
When you sort by col1
(single character), it first determines that all characters have the same weight since they are all "e". Next, it applies the accent / diacritic weights. There are no casing differences so the third step wouldn't change anything. So the only differences are in step 2, which is why there is a preferred order for those rows based on col1
.
When you sort by col2
(two characters), it first determines that each row has a different weight since both characters are used to determine the sort weight (e.g. "ea", "eb", etc). Next, it applies the accent / diacritic weights. There are no casing differences so the third step wouldn't change anything. So there are differences in steps 1 and 2 this time. But since the differences in step 1 have already been applied to each string before the weights of step 2 are considered, the weights from step 2 don't have any effect on the ordering; they would only apply if weights from step 1 for two or more rows were the same.
The following adaptation of the sample code from the question hopefully illustrates the sorting behavior described above. I added some additional rows and an additional column to help show the impact of the Collation being case-sensitive (since the original sample data is all lower-case):
SETUP
USE [tempdb];
-- DROP TABLE dbo.OddSort;
CREATE TABLE dbo.OddSort
(
id INT IDENTITY(1,1) PRIMARY KEY,
col1 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS,
col2 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS,
col3 NVARCHAR(5) COLLATE Latin1_General_100_CS_AS
);
GO
INSERT dbo.OddSort (col1, col2, col3)
VALUES (N'e', N'eA', N'e A')
, (N'ê', N'êE', N'ê E')
, (N'é', N'éH', N'é H')
, (N'ë', N'ëC', N'ë C')
, (N'E', N'EG', N'E G')
, (N'Ë', N'ëh', N'ë h')
, (N'è', N'èD', N'è D')
, (N'é', N'éB', N'é B')
, (N'ë', N'ëH', N'ë H')
, (N'ē', N'ēF', N'ē F');
TEST 1
SELECT [id], [col1], UNICODE([col1]) AS [CodePoint]
FROM dbo.OddSort
ORDER BY col1;
Returns:
id col1 CodePoint
1 e 101
5 E 69
8 é 233
3 é 233
7 è 232
2 ê 234
4 ë 235
9 ë 235
6 Ë 203
10 ē 275
What we can see in the results above:
- The Code Point is not determining the sort order
- The non-accented characters are sorted before the accented characters (within the same letter: f would still come after all of these). Clearly, accent weights are applied before case weights.
- Lower-case sorts before upper-case within the same accented (or non-accented) character (i.e. the e then E, and the ë then Ë). This tailoring is used by most of the Windows Collations, while most of the SQL Server Collations sort upper-case first.
TEST 2
SELECT [id], [col2]
FROM dbo.OddSort
ORDER BY col2;
Returns:
id col2
1 eA
8 éB
4 ëC
7 èD
2 êE
10 ēF
5 EG
3 éH
6 ëh
9 ëH
What we can see in the results above:
- First-level sorting truly is the base characters. If it were accents / diacritics then the ëC (id = 4), ēF (id = 10), and EG (id = 5) rows would not be where they are. If it were casing, then the EG (id = 5) row would not be where it is.
- Second-level sorting truly is the accents / diacritics. This explains why the last three rows are éH -> ëh -> ëH instead of ëh -> éH -> ëH (i.e. IDs 3 -> 6 -> 9 instead of 6 -> 3 -> 9).
- Third-level sorting truly is the casing. This is why the last 2 rows are ëh -> ëH, since lower-case sorts first.
TEST 3
SELECT [id], [col3]
FROM dbo.OddSort
ORDER BY col3;
Returns:
id col3
1 e A
8 é B
4 ë C
7 è D
2 ê E
10 ē F
5 E G
3 é H
6 ë h
9 ë H
What we can see in the results above:
- The sort order is exactly the same as in Test 2. The only difference in the test values here is that there is a space between each character, removing the possibility of contextual rules. Hence, we know that the reason for the difference in sort order for
col2
in the question is again due to "Multi-Level Comparison", and not "Contextual Sensitivity".
Additional notes:
With regarding to getting the exact rules, that is not as easy as it should be. The problem with getting concrete explanations of these rules is that the Unicode sorting rules, while definitely documented, are a recommendation. It is up to vendors, such as Microsoft, to implement those recommendations. Microsoft did not implement the recommendations exactly as stated in the Unicode documentation so there is a disconnect there (similar to how neither the HTML or CSS specifications are implemented as completely, nor even in the same way, across vendors). Then, there are different versions of the Windows Collations (you are using version
100
which came out with SQL Server 2008) and that is tied to a Unicode version that is much older than the current version of Unicode or of the ICU Collation demo is using. For example, the What's New in SQL Server 2008 Collations section of the "Collation and Unicode Support" documentation for SQL Server 2008 makes 2 very interesting points about what is "new" to the_100_
series Collations:Unicode 5.0 case table.
Unicode 5.0 was published in July of 2006 (well, the character database was released then, and the full spec followed in late 2006). The current version is 10.0 which was published in June of 2017. And given the release pattern of the past 4 years, it is likely that version 11.0 will be out sometime mid-2018.
Weighting has been added to previously non-weighted characters that would have compared equally.
Those weights were more than likely defined in the Unicode Standard, just not in this implementation of it.
Still, the UCA documentation linked above is a good place to start.Sort Keys used by Windows / .NET / SQL Server are not exactly the same as shown in the Unicode Standard (See @Patrick's answer) or implemented in the ICU. To see what Windows / .NET / SQL Server use, you can try the CompareInfo.GetSortKey Method. I created a SQLCLR UDF to pass in these values and get the sort key. Please note that I am using SQL Server 2017 on Windows 10 with .NET Framework 4.5 - 4.6.1 installed, so .NET should be using Unicode 6.0.0. Also, Level4 is not being used for these strings.
CHAR L1 L2 L3 L4 e 0E21 E 0E21 12 ë 0E21 13 Ë 0E21 13 12
Looking at these sort keys for Test 1, and realizing that the levels are sorted like multiple columns within an
ORDER BY
clause (L3 is sorted within same values of L2, which is sorted within same values of L1), should illustrate that the reason for the behavior noted in the question is in fact the multi-level sorting capability of Unicode. Likewise:CHAR L1 L2 L3 L4 EG 0E210E25 1212 éH 0E210E2C 0E 0212 ëh 0E210E2C 13 ëH 0E210E2C 13 0212
Looking at some of the sort keys for Test 2, we can see that base characters are sorted first (L1), then accents are sorted (L2), and then casing is sorted (L3).
Since the datatype is
NVARCHAR
, we are only concerned with Unicode Code Points and sorting algorithms, hence the use of theUNICODE()
function in TEST 1. While Code Pages are specified by most Collations, they only pertain toVARCHAR
data. Meaning, while Code Page 1252 is specified by theLatin1_General*
series of Collations, that can be ignored here.The weights described in the Default Unicode Collation Element Table (DUCET) (Version 5.0.0 which should map to the
_100_
series Collations) are fine for US English, but not other locales / languages. Other languages need to start with the DUCET and then apply locale-specific override rules as defined by the Common Locale Data Repository (CLDR) project. From what I can tell, versions 1.4 / 1.4.1 were released in 2006. To get those overrides, download the CLDR 1.4 "core" file via http://unicode.org/Public/cldr/1.4.0/core.zip, then, in that zip file, go to the collation folder and find the XML file corresponding to the locale being used. Those files contain just the overrides, and are not complete sets of collation rules.
This question is not so related to databases but more on Unicode handling and rules.
Based on https://docs.microsoft.com/en-us/sql/t-sql/statements/windows-collation-name-transact-sql Latin1_General_100_CS_AS means: "Collation uses the Latin1 General dictionary sorting rules and maps to code page 1252" with the added CS = Case Sensitive and AS = Accent Sensitive.
The mapping between Windows code page 1252 and Unicode (http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT) show the same values for all characters we are dealing with (except e with macron that does not exist in Microsoft mapping, so no idea what it does with this case), so we can concentrate on Unicode tools and terminology for now.
First, let us know precisely what we are dealing with, for all your strings:
0065 LATIN SMALL LETTER E
0041 LATIN CAPITAL LETTER A
00E9 LATIN SMALL LETTER E WITH ACUTE
0042 LATIN CAPITAL LETTER B
00EB LATIN SMALL LETTER E WITH DIAERESIS
0043 LATIN CAPITAL LETTER C
00E8 LATIN SMALL LETTER E WITH GRAVE
0044 LATIN CAPITAL LETTER D
00EA LATIN SMALL LETTER E WITH CIRCUMFLEX
0045 LATIN CAPITAL LETTER E
0113 LATIN SMALL LETTER E WITH MACRON
0046 LATIN CAPITAL LETTER F
The Unicode Collation Algorithm is described here: https://www.unicode.org/reports/tr10/
Have a look at section 1.3 "Contextual Sensitivity" that explains that the sorting can not depend on just one character after the other as some rules are context sensitive.
Note also these points in 1.8:
Collation is not a property of strings. Collation order is not preserved under concatenation or substring operations, in general.
By default, the algorithm makes use of three fully-customizable levels. For the Latin script, these levels correspond roughly to:
alphabetic ordering
diacritic ordering
case ordering.
But the algorithm by itself is a little dense. The gist of it is: Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated.
You can view the specific Latin sorting rules here: http://developer.mimer.com/collations/charts/latin.htm or more directly and specifically for MS SQL: http://collation-charts.org/mssql/mssql.0409.1252.Latin1_General_CS_AS.html
For the e
character it shows:
e E é É è È ê Ê ë Ë
This explains your results when ordering on col1
except that ē does not exists in code page 1252, so I have absolutely no ideas what it does with it.
Or if we do the Unicode algorithm by hand, using the keys value of DUCET at http://www.unicode.org/Public/UCA/latest/allkeys.txt :
step 1: Normalization form D, so each case becomes:
e => U+0065
é => U+0065 U+0301
ë => U+0065 U+0308
è => U+0065 U+0300
ê => U+0065 U+0302
ē => U+0065 U+0304
step 2, Produce collation arrays (lookup in file allkeys.txt
)
e => [.1D10.0020.0002]
é => [.1D10.0020.0002] [.0000.0024.0002]
ë => [.1D10.0020.0002] [.0000.002B.0002]
è => [.1D10.0020.0002] [.0000.0025.0002]
ê => [.1D10.0020.0002] [.0000.0027.0002]
ē => [.1D10.0020.0002] [.0000.0032.0002]
step 3, Form sort keys (for each level, take each value inside each collation array, then put 0000 as delimitator and start again for next level)
e => 1D10 0000 0020 0000 0002
é => 1D10 0000 0020 0024 0000 0002 0002
ë => 1D10 0000 0020 002B 0000 0002 0002
è => 1D10 0000 0020 0025 0000 0002 0002
ê => 1D10 0000 0020 0027 0000 0002 0002
ē => 1D10 0000 0020 0032 0000 0002 0002
step 4, Compare sort keys (simple binary comparison of each value one by one): The fourth value is enough to sort them all, so the final order becomes:
e
é
è
ê
ë
ē
In the same way for ordering on col2
:
step 1 : NFD
eA => U+0065 U+0041
éB => U+0065 U+0301 U+0042
ëC => U+0065 U+0308 U+0043
èD => U+0065 U+0300 U+0044
êE => U+0065 U+0302 U+0045
ēF => U+0065 U+0304 U+0046
step 2 : Collation arrays
eA => [.1D10.0020.0002] [.1CAD.0020.0008]
éB => [.1D10.0020.0002] [.0000.0024.0002] [.1CC6.0020.0008]
ëC => [.1D10.0020.0002] [.0000.002B.0002] [.1CE0.0020.0008]
èD => [.1D10.0020.0002] [.0000.0025.0002] [.1CF5.0020.0008]
êE => [.1D10.0020.0002] [.0000.0027.0002] [.1D10.0020.0008]
ēF => [.1D10.0020.0002] [.0000.0032.0002] [.1D4B.0020.0008]
step 3 : Form sort keys
eA => 1D10 1CAD 0000 0020 0020 0000 0002 0008
éB => 1D10 1CC6 0000 0020 0024 0020 0000 0002 0002 0008
ëC => 1D10 1CE0 0000 0020 002B 0020 0000 0002 0002 0008
èD => 1D10 1CF5 0000 0020 0025 0020 0000 0002 0002 0008
êE => 1D10 1D10 0000 0020 0027 0020 0000 0002 0002 0008
ēF => 1D10 1D4B 0000 0020 0032 0020 0000 0002 0002 0008
step 4 : Compare sort keys: The second value is enough to sort them all, and it is in fact already in increasing order, so the final order is indeed:
eA
éB
ëC
èD
êE
ēF
Update: adding Solomon Rutzky third case, which is trickier because of the space that enables new rules (I chose the "non-ignorable case"):
step 1, NFD:
è 1 => U+0065 U+0300 U+0020 U+0031
ê 5 => U+0065 U+0302 U+0020 U+0035
e 2 => U+0065 U+0020 U+0032
é 4 => U+0065 U+0301 U+0020 U+0034
ē 3 => U+0065 U+0304 U+0020 U+0033
ë 6 => U+0065 U+0308 U+0020 U+0036
step 2, Produce collation arrays:
è 1 => [.1D10.0020.0002] [.0000.0025.0002] [*0209.0020.0002] [.1CA4.0020.0002]
ê 5 => [.1D10.0020.0002] [.0000.0027.0002] [*0209.0020.0002] [.1CA8.0020.0002]
e 2 => [.1D10.0020.0002] [*0209.0020.0002] [.1CA5.0020.0002]
é 4 => [.1D10.0020.0002] [.0000.0024.0002] [*0209.0020.0002] [.1CA7.0020.0002]
ē 3 => [.1D10.0020.0002] [.0000.0032.0002] [*0209.0020.0002] [.1CA6.0020.0002]
ë 6 => [.1D10.0020.0002] [.0000.002B.0002] [*0209.0020.0002] [.1CA9.0020.0002]
step 3, Form sort keys:
è 1 => 1D10 0209 1CA4 0000 0020 0025 0020 0020 0000 0002 0002 0002 0002
ê 5 => 1D10 0209 1CA8 0000 0020 0027 0020 0020 0000 0002 0002 0002 0002
e 2 => 1D10 0209 1CA5 0000 0020 0020 0020 0000 0002 0002 0002
é 4 => 1D10 0209 1CA7 0000 0020 0024 0020 0020 0000 0002 0002 0002 0002
ē 3 => 1D10 0209 1CA6 0000 0020 0032 0020 0020 0000 0002 0002 0002 0002
ë 6 => 1D10 0209 1CA9 0000 0020 002B 0020 0020 0000 0002 0002 0002 0002
step 4, Compare sort keys:
Basically the third value determines the order, and it is in fact only based on the last digit, so the order should be:
è 1
e 2
ē 3
é 4
ê 5
ë 6
Second update based on Solomon Rutzky's comment about Unicode versions.
I used the allkeys.txt
data about latest Unicode version at this time, that is version 10.0
If we need to take instead into account Unicode 5.1, this would be: http://www.unicode.org/Public/UCA/5.1.0/allkeys.txt
I just checked, for all the characters above, the collation arrays are the following instead:
e => [.119D.0020.0002.0065]
é => [.119D.0020.0002.0065] [.0000.0032.0002.0301]
ë => [.119D.0020.0002.0065] [.0000.0047.0002.0308]
è => [.119D.0020.0002.0065] [.0000.0035.0002.0300]
ê => [.119D.0020.0002.0065] [.0000.003C.0002.0302]
ē => [.119D.0020.0002.0065] [.0000.005B.0002.0304]
and:
eA => [.119D.0020.0002.0065] [.1141.0020.0008.0041]
éB => [.119D.0020.0002.0065] [.0000.0032.0002.0301] [.1157.0020.0008.0042]
ëC => [.119D.0020.0002.0065] [.0000.0047.0002.0308] [.116F.0020.0008.0043]
èD => [.119D.0020.0002.0065] [.0000.0035.0002.0300] [.1182.0020.0008.0044]
êE => [.119D.0020.0002.0065] [.0000.003C.0002.0302] [.119D.0020.0008.0045]
ēF => [.119D.0020.0002.0065] [.0000.005B.0002.0304] [.11D5.0020.0008.0046]
and:
è 1 => [.119D.0020.0002.0065] [.0000.0035.0002.0300] [*0209.0020.0002.0020] [.1138.0020.0002.0031]
ê 5 => [.119D.0020.0002.0065] [.0000.003C.0002.0302] [*0209.0020.0002.0020] [.113C.0020.0002.0035]
e 2 => [.119D.0020.0002.0065] [*0209.0020.0002.0020] [.1139.0020.0002.0032]
é 4 => [.119D.0020.0002.0065] [.0000.0032.0002.0301] [*0209.0020.0002.0020] [.113B.0020.0002.0034]
ē 3 => [.119D.0020.0002.0065] [.0000.005B.0002.0304] [*0209.0020.0002.0020] [.113A.0020.0002.0033]
ë 6 => [.119D.0020.0002.0065] [.0000.0047.0002.0308] [*0209.0020.0002.0020] [.113D.0020.0002.0036]
which then compute to the following sorting keys:
e => 119D 0000 0020 0000 0002 0000 0065
é => 119D 0000 0020 0032 0000 0002 0002 0000 0065 0301
ë => 119D 0000 0020 0047 0000 0002 0002 0000 0065 0308
è => 119D 0000 0020 0035 0000 0002 0002 0000 0065 0300
ê => 119D 0000 0020 003C 0000 0002 0002 0000 0065 0302
ē => 119D 0000 0020 005B 0000 0002 0002 0000 0065 0304
and:
eA => 119D 1141 0000 0020 0020 0000 0002 0008 0000 0065 0041
éB => 119D 1157 0000 0020 0032 0020 0000 0002 0002 0008 0000 0065 0301 0042
ëC => 119D 116F 0000 0020 0047 0020 0000 0002 0002 0008 0000 0065 0308 0043
èD => 119D 1182 0000 0020 0035 0020 0000 0002 0002 0008 0000 0065 0300 0044
êE => 119D 119D 0000 0020 003C 0020 0000 0002 0002 0008 0000 0065 0302 0045
ēF => 119D 11D5 0000 0020 005B 0020 0000 0002 0002 0008 0000 0065 0304 0046
and:
è 1 => 119D 0209 1138 0000 0020 0035 0020 0020 0000 0002 0002 0002 0002 0000 0065 0300 0020 0031
ê 5 => 119D 0209 113C 0000 0020 003C 0020 0020 0000 0002 0002 0002 0002 0000 0065 0302 0020 0035
e 2 => 119D 0209 1139 0000 0020 0020 0020 0000 0002 0002 0002 0000 0065 0020 0032
é 4 => 119D 0209 113B 0000 0020 0032 0020 0020 0000 0002 0002 0002 0002 0000 0065 0301 0020 0034
ē 3 => 119D 0209 113A 0000 0020 005B 0020 0020 0000 0002 0002 0002 0002 0000 0065 0304 0020 0033
ë 6 => 119D 0209 113D 0000 0020 0047 0020 0020 0000 0002 0002 0002 0002 0000 0065 0308 0020 0036
which again gives these three sorted results:
e
é
è
ê
ë
ē
and
eA
éB
ëC
èD
êE
ēF
and
è 1
e 2
ē 3
é 4
ê 5
ë 6