String.StartsWith not working with Asian languages?

The result returned by StartsWith is correct. By default, most string comparison methods perform culture-sensitive comparisons using the current culture, not plain byte sequences. Although your line starts with a byte sequence identical to sub, the substring it represents is not equivalent under most (or all) cultures.

If you really want a comparison that treats strings as plain byte sequences, use the overload:

line.StartsWith(sub, StringComparison.Ordinal);                       // true

If you want the comparison to be case-insensitive:

line.StartsWith(sub, StringComparison.OrdinalIgnoreCase);             // true

Here's a more familiar example:

var line1 = "café";   // 63 61 66 E9     – precomposed character 'é' (U+00E9)
var line2 = "café";   // 63 61 66 65 301 – base letter e (U+0065) and
                      //                   combining acute accent (U+0301)
var sub   = "cafe";   // 63 61 66 65 
Console.WriteLine(line1.StartsWith(sub));                             // false
Console.WriteLine(line2.StartsWith(sub));                             // false
Console.WriteLine(line1.StartsWith(sub, StringComparison.Ordinal));   // false
Console.WriteLine(line2.StartsWith(sub, StringComparison.Ordinal));   // true

In the above examples, line2 starts with the same byte sequence as sub, followed by a combining acute accent (U+0301) to be applied to the final e. line1 uses the precomposed character for é (U+00E9), so its byte sequence does not match that of sub.

In real-world semantics, one would typically not consider cafe to be a substring of café; the e and é are treated as distinct characters. That é happens to be represented as a pair of characters starting with e is an internal implementation detail of the encoding scheme (Unicode) that should not affect results. This is demonstrated by the above example contrasting café and café; one would not expect different results unless specifically intending an ordinal (byte-by-byte) comparison.

Adapting this explanation to your example:

string line = "Mìng-dĕ̤ng-ngṳ̄";   // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73 304
string sub  = "Mìng-dĕ̤ng-ngṳ";   // 4D EC 6E 67 2D 64 115 324 6E 67 2D 6E 67 1E73

Each .NET character represents a UTF-16 code unit, whose values are shown in the comments above. The first 14 code units are identical, which is why your char-by-char comparison evaluates to true (just like StringComparison.Ordinal). However, the 15th code unit in line is the combining macron, ◌̄ (U+0304), which combines with its preceding ṳ (U+1E73) to give ṳ̄.

This is not a bug. The String.StartsWith is in fact much smarter than just a character-by-character check of your two strings. It takes into account your current culture (language settings, etc.) and it takes into account contractions and special characters. (It does not care you need two characters to end up with ṳ̄. It compares it as one).

So this means that if you don't want to take all those culture specific settings, and just want to check it using ordinal comparison, you have to tell the comparer that.

This is the correct way to do that (not ignoring the case, like Douglas did!):

line.StartsWith(sub, StringComparison.Ordinal);

String.StartsWith not working with Asian languages?

Tags:

C#

String

Related

Recent Posts