How to remove diacritics from text?
This should be useful which handles almost all the cases.
function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
// normalize data (remove accent marks) using PHP's *intl* extension
$data = normalizer_normalize($data);
// replace everything NOT in the sets you specified with an underscore
$data = preg_replace("#[^A-Za-z1-9]#","_", $data);
Use iconv to convert strings from a given encoding to ASCII, then replace non-alphanumeric characters using preg_replace:
$input = 'räksmörgås och köttbullar'; // UTF8 encoded
$input = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
$input = preg_replace('/[^a-zA-Z0-9]/', '_', $input);
echo $input;
Result:
raksmorgas_och_kottbullar
and all swedish should be converted like this:
'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).
Use normalizer_normalize()
to get rid of diacritical marks.
The rest should become underscores as I said.
Use preg_replace()
with a pattern of [\W]
(i.o.w: any character which doesn't match letters, digits or underscore) to replace them by underscores.
Final result should look like:
$data = preg_replace('[\W]', '_', normalizer_normalize($data));