Ensuring valid UTF-8 in PHP
With the mbstring library, you have mb_check_encoding().
Example of use:
mb_check_encoding($string, 'UTF-8');
With PHP 7.1.9 on a recent Windows 10 system, the regex solution outperforms mb_check_encoding()
for any string length (still 20,000 iterations):
- 10 characters: regex => 4 ms,
mb_check_encoding()
=> 64 ms - 10000 chars: regex => 125 ms,
mb_check_encoding()
=> 2.4 s
UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.
Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.
would this code ensure that a string is safe to insert into a UTF-8 encoded document
You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.
If you want to be sure, do it yourself using the W3-recommended regex:
if (preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string))
return $string;
else
return iconv('CP1252', 'UTF-8', $string);
Just a note: Instead of using the often recommended (rather complex) regular expression by W3C, you can simply use the 'u' modifier to test a string for UTF-8 validity:
<?php
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}
Answer to "iconv is idempotent":
Neither is iconv - iconv is not idempotent.
A big difference between utf8_encode()
and iconv()
is that iconv may raise errors like this "Detected an incomplete multibyte character in input string", even with:
iconv('ISO-8859-1', 'UTF-8'.'//IGNORE', $str)
in the above code:
$encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");
You have to know mb_detect_encoding
. It can answer about uft-8 even for invalid UTF-8 strings (badly formed UTF-8).