String.StartsWith not working with Asian languages?
The result returned by StartsWith
is correct. By default, most string comparison methods perform culture-sensitive comparisons using the current culture, not plain byte sequences. Although your line
starts with a byte sequence identical to sub
, the substring it represents is not equivalent under most (or all) cultures.
If you really want a comparison that treats strings as plain byte sequences, use the overload:
line.StartsWith(sub, StringComparison.Ordinal); // true
If you want the comparison to be case-insensitive:
line.StartsWith(sub, StringComparison.OrdinalIgnoreCase); // true
Here's a more familiar example:
var line1 = "café"; // 63 61 66 E9 – precomposed character 'é' (U+00E9)
var line2 = "café"; // 63 61 66 65 301 – base letter e (U+0065) and
// combining acute accent (U+0301)
var sub = "cafe"; // 63 61 66 65
Console.WriteLine(line1.StartsWith(sub)); // false
Console.WriteLine(line2.StartsWith(sub)); // false
Console.WriteLine(line1.StartsWith(sub, StringComparison.Ordinal)); // false
Console.WriteLine(line2.StartsWith(sub, StringComparison.Ordinal)); // true
In the above examples, line2
starts with the same byte sequence as sub
, followed by a combining acute accent (U+0301) to be applied to the final e
. line1
uses the precomposed character for é
(U+00E9), so its byte sequence does not match that of sub
.
In real-world semantics, one would typically not consider cafe
to be a substring of café
; the e
and é
are treated as distinct characters. That é
happens to be represented as a pair of characters starting with e
is an internal implementation detail of the encoding scheme (Unicode) that should not affect results. This is demonstrated by the above example contrasting café
and café
; one would not expect different results unless specifically intending an ordinal (byte-by-byte) comparison.
Adapting this explanation to your example:
string line = "Mìng-dĕ̤ng-ngṳ̄"; // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 304
string sub = "Mìng-dĕ̤ng-ngṳ"; // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73
Each .NET character represents a UTF-16 code unit, whose values are shown in the comments above. The first 14 code units are identical, which is why your char-by-char comparison evaluates to true (just like StringComparison.Ordinal
). However, the 15th code unit in line
is the combining macron, ◌̄ (U+0304), which combines with its preceding ṳ
(U+1E73) to give ṳ̄
.
This is not a bug. The String.StartsWith
is in fact much smarter than just a character-by-character check of your two strings. It takes into account your current culture (language settings, etc.) and it takes into account contractions and special characters. (It does not care you need two characters to end up with ṳ̄
. It compares it as one).
So this means that if you don't want to take all those culture specific settings, and just want to check it using ordinal comparison, you have to tell the comparer that.
This is the correct way to do that (not ignoring the case, like Douglas did!):
line.StartsWith(sub, StringComparison.Ordinal);