Removing accents and diacritics in kotlin
TL;DR:
- Use
Normalizer
to canonically decomposed the Unicode thext. - Remove non-spacing combining characters (
\p{Mn}
).
fun String.removeNonSpacingMarks() =
Normalizer.normalize(this, Normalizer.Form.NFD)
.replace("\\p{Mn}+".toRegex(), "")
Long answer:
Using Normalizer you can transform the original text into an equivalent composed or decomposed form.
- NFD: Canonical decomposition.
- NFC: Canonical decomposition, followed by canonical composition.
.
(more info about normalization can be found in the Unicode® Standard Annex #15)
In our case, we are interested in NFD normalization form because it allows us to separate all the combined characters from the base character.
After decomposing the text, we have to run a regex to remove all the new characters resulting from the decomposition that correspond to combined characters.
Combined characters are special characters intended to be positioned relative to an associated base character. The Unicode Standard distinguishes two types of combining characters: spacing and nonspacing.
We are only interested in non-spacing combining characters. Diacritics are the principal class (but not the only one) of this group used with Latin, Greek, and Cyrillic scripts and their relatives.
To remove non-spacing characters with a regex we have to use \p{Mn}
. This group includes all the 1,826 non-spacing characters.
Other answers uses \p{InCombiningDiacriticalMarks}
, this block only includes combining diacritical marks. It is a subset of \p{Mn}
that includes only 112 characters.
Normalizer only does half the work. Here's how you could use it:
private val REGEX_UNACCENT = "\\p{InCombiningDiacriticalMarks}+".toRegex()
fun CharSequence.unaccent(): String {
val temp = Normalizer.normalize(this, Normalizer.Form.NFD)
return REGEX_UNACCENT.replace(temp, "")
}
assert("áéíóů".unaccent() == "aeiou")
And here's how it works:
We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.
Source: http://www.rgagnon.com/javadetails/java-0456.html
Note that Normalizer
is a Java class; this is not pure Kotlin and it will only work on JVM.