What .NET StringComparer is equivalent SQL's Latin1_General_CI_AS
I've recently faced with the same problem: I need an IEqualityComparer<string>
that behaves in SQL-like style. I've tried CollationInfo
and its EqualityComparer
. If your DB is always _AS (accent sensitive) then your solution will work, but in case if you change the collation that is AI or WI or whatever "insensitive" else the hashing will break.
Why? If you decompile Microsoft.SqlServer.Management.SqlParser.dll and look inside you'll find out that CollationInfo
internally uses CultureAwareComparer.GetHashCode
(it's internal class of mscorlib.dll) and finally it does the following:
public override int GetHashCode(string obj)
{
if (obj == null)
throw new ArgumentNullException("obj");
CompareOptions options = CompareOptions.None;
if (this._ignoreCase)
options |= CompareOptions.IgnoreCase;
return this._compareInfo.GetHashCodeOfString(obj, options);
}
As you can see it can produce the same hashcode for "aa" and "AA", but not for "äå" and "aa" (which are the same, if you ignore diacritics (AI) in majority of cultures, so they should have the same hashcode). I don't know why the .NET API is limited by this, but you should understand where the problem can come from.
To get the same hashcode for strings with diacritics you can do the following: create implementation of IEqualityComparer<T>
implementing the GetHashCode
that will call appropriate CompareInfo
's object's GetHashCodeOfString
via reflection because this method is internal and can't be used directly. But calling it directly with correct CompareOptions
will produce the desired result:
See this example:
static void Main(string[] args)
{
const string outputPath = "output.txt";
const string latin1GeneralCiAiKsWs = "Latin1_General_100_CI_AI_KS_WS";
using (FileStream fileStream = File.Open(outputPath, FileMode.Create, FileAccess.Write))
{
using (var streamWriter = new StreamWriter(fileStream, Encoding.UTF8))
{
string[] strings = { "aa", "AA", "äå", "ÄÅ" };
CompareInfo compareInfo = CultureInfo.GetCultureInfo(1033).CompareInfo;
MethodInfo GetHashCodeOfString = compareInfo.GetType()
.GetMethod("GetHashCodeOfString",
BindingFlags.Instance | BindingFlags.NonPublic,
null,
new[] { typeof(string), typeof(CompareOptions), typeof(bool), typeof(long) },
null);
Func<string, int> correctHackGetHashCode = s => (int)GetHashCodeOfString.Invoke(compareInfo,
new object[] { s, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, false, 0L });
Func<string, int> incorrectCollationInfoGetHashCode =
s => CollationInfo.GetCollationInfo(latin1GeneralCiAiKsWs).EqualityComparer.GetHashCode(s);
PrintHashCodes(latin1GeneralCiAiKsWs, incorrectCollationInfoGetHashCode, streamWriter, strings);
PrintHashCodes("----", correctHackGetHashCode, streamWriter, strings);
}
}
Process.Start(outputPath);
}
private static void PrintHashCodes(string collation, Func<string, int> getHashCode, TextWriter writer, params string[] strings)
{
writer.WriteLine(Environment.NewLine + "Used collation: {0}", collation + Environment.NewLine);
foreach (string s in strings)
{
WriteStringHashcode(writer, s, getHashCode(s));
}
}
The output is:
Used collation: Latin1_General_100_CI_AI_KS_WS
aa, hashcode: 2053722942
AA, hashcode: 2053722942
äå, hashcode: -266555795
ÄÅ, hashcode: -266555795
Used collation: ----
aa, hashcode: 2053722942
AA, hashcode: 2053722942
äå, hashcode: 2053722942
ÄÅ, hashcode: 2053722942
I know it looks like the hack, but after inspecting decompiled .NET code I'm not sure if there any other option in case the generic functionality is needed.
So be sure that you'll not fall into trap using this not fully correct API.
UPDATE:
I've also created the gist with potential implementation of "SQL-like comparer" using CollationInfo
.
Also there should be paid enough attention where to search for "string pitfalls" in your code base, so if the string comparison, hashcode, equality should be changed to "SQL collation-like" those places are 100% will be broken, so you'll have to find out and inspect all the places that can be broken.
UPDATE #2:
There is better and cleaner way to make GetHashCode() treat CompareOptions. There is the class SortKey that works correctly with CompareOptions and it can be retrieved using
CompareInfo.GetSortKey(yourString, yourCompareOptions).GetHashCode()
Here is the link to .NET source code and implementation.
UPDATE #3:
If you're on .NET Framework 4.7.1+ you should use new GlobalizationExtensions
class as proposed by this recent answer.
Take a look at the CollationInfo
class. It is located in an assembly called Microsoft.SqlServer.Management.SqlParser.dll
although I am not totally sure where to get this. There is a static list of Collations
(names) and a static method GetCollationInfo
(by name).
Each CollationInfo
has a Comparer
. It is not exactly the same as a StringComparer
but has similar functionality.
EDIT: Microsoft.SqlServer.Management.SqlParser.dll is a part of the Shared Management Objects (SMO) package. This feature can be downloaded for SQL Server 2008 R2 here:
http://www.microsoft.com/download/en/details.aspx?id=16978#SMO
EDIT: CollationInfo
does have a property named EqualityComparer
which is an IEqualityComparer<string>
.
SQL Server's Server.GetStringComparer may be of some use.