Removing diacritics in Polish
Some time ago I've come across this solution, which seems to work fine:
public static string RemoveDiacritics(this string s)
{
string asciiEquivalents = Encoding.ASCII.GetString(
Encoding.GetEncoding("Cyrillic").GetBytes(s)
);
return asciiEquivalents;
}
Here is my quick implementation of Polish stoplist with normalization of Polish diacritics.
class StopList
{
private HashSet<String> set = new HashSet<String>();
public void add(String word)
{
word = word.trim().toLowerCase();
word = normalize(word);
set.add(word);
}
public boolean contains(final String string)
{
return set.contains(string) || set.contains(normalize(string));
}
private char normalizeChar(final char c)
{
switch ( c)
{
case 'ą':
return 'a';
case 'ć':
return 'c';
case 'ę':
return 'e';
case 'ł':
return 'l';
case 'ń':
return 'n';
case 'ó':
return 'o';
case 'ś':
return 's';
case 'ż':
case 'ź':
return 'z';
}
return c;
}
private String normalize(final String word)
{
if (word == null || "".equals(word))
{
return word;
}
char[] charArray = word.toCharArray();
char[] normalizedArray = new char[charArray.length];
for (int i = 0; i < normalizedArray.length; i++)
{
normalizedArray[i] = normalizeChar(charArray[i]);
}
return new String(normalizedArray);
}
}
I couldnt find any other solution in the Net. So maybe it will be helpful for someone (?)