How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?
Since 4-byte UTF-8 sequences always start with the bytes 0xF0-0xF7
, the following should work:
$str = preg_replace('/[\xF0-\xF7].../s', '', $str);
Alternatively, you could use preg_replace
in UTF-8 mode but this will probably be slower:
$str = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $str);
This works because 4-byte UTF-8 sequences are used for code points in the supplementary Unicode planes starting from 0x10000
.
NOTE: you should not just strip, but replace with replacement character U+FFFD to avoid unicode attacks, mostly XSS:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $value);