Is there a way to get rid of accents and convert a whole string to regular letters?
Use java.text.Normalizer
to handle this for you.
string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatable" deconstruction
This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.
string = string.replaceAll("[^\\p{ASCII}]", "");
If your text is in unicode, you should use this instead:
string = string.replaceAll("\\p{M}", "");
For unicode, \\P{M}
matches the base glyph and \\p{M}
(lowercase) matches each accent.
Thanks to GarretWilson for the pointer and regular-expressions.info for the great unicode guide.
As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):
String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
System.out.println(input);
// Prints "This is a funky String"
Note:
The accepted answer (Erick Robertson's) doesn't work for Ø or Ł. Apache Commons 3.5 doesn't work for Ø either, but it does work for Ł. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.
The solution by @virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:
import java.text.Normalizer;
public class Strip {
public static String flattenToAscii(String string) {
StringBuilder sb = new StringBuilder(string.length());
string = Normalizer.normalize(string, Normalizer.Form.NFD);
for (char c : string.toCharArray()) {
if (c <= '\u007F') sb.append(c);
}
return sb.toString();
}
}
Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
string = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = string.length(); i < n; ++i) {
char c = string.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}
This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that @virgo47's (the accepted answer is about 26x slower than @virgo47's on my machine).