PHP - length of string containing emojis/special chars
I know this is an older question, but I recently came across this problem and this might help someone else too.
The string in the question:
$str = "ðð¿✌ð¿️ @mention";
I would expect it to count 11: ðð¿✌ð¿️
{2}, space
{1}, @
{1}, mention
{7}
so 2 + 1 + 1 + 7 = 11
PHP's intl
extension has a grapheme_strlen()
function that gets string length in grapheme units instead of bytes or characters (documentation)
<?php
$str = "ðð¿✌ð¿️ @mention";
printf("strlen: %d" . PHP_EOL, strlen($str));
printf("mb_strlen UTF-8: %d" . PHP_EOL, mb_strlen($str, "UTF-8"));
printf("mb_strlen UTF-16: %d" . PHP_EOL, mb_strlen($str, "UTF-16"));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("UTF-8", "UTF-16", $str)));
printf("iconv UTF-16: %d" . PHP_EOL, iconv_strlen(iconv("ISO-8859-1", "UTF-16", $str)));
printf("grapheme_strlen: %d" . PHP_EOL, grapheme_strlen($str));
output on PHP 7.4:
strlen: 27
mb_strlen UTF-8: 14
mb_strlen UTF-16: 13
PHP Notice: iconv_strlen(): Detected an illegal character in input string in php shell code on line 1
iconv UTF-16: 0
PHP Notice: iconv_strlen(): Detected an illegal character in input string in php shell code on line 1
iconv UTF-16: 0
grapheme_strlen: 11
Or try it here: https://www.tehplayground.com/s8e40DAslYX4Mozo
Your functions are all counting different things.
Graphemes: ð ð¿ ✌ ð¿️ @ m e n t i o n 13
----------- ----------- -------- --------------------- ------ ------ ------ ------ ------ ------ ------ ------ ------
Code points: U+1F44D U+1F3FF U+270C U+1F3FF U+FE0F U+0020 U+0040 U+006D U+0065 U+006E U+0074 U+0069 U+006F U+006E 14
UTF-16 code units: D83D DC4D D83C DFFF 270C D83C DFFF FE0F 0020 0040 006D 0065 006E 0074 0069 006F 006E 17
UTF-16-encoded bytes: 3D D8 4D DC 3C D8 FF DF 0C 27 3C D8 FF DF 0F FE 20 00 40 00 6D 00 65 00 6E 00 74 00 69 00 6F 00 6E 00 34
UTF-8-encoded bytes: F0 9F 91 8D F0 9F 8F BF E2 9C 8C F0 9F 8F BF EF B8 8F 20 40 6D 65 6E 74 69 6F 6E 27
PHP strings are natively bytes.
strlen()
counts the number of bytes in a string: 27.
mb_strlen(..., 'utf-8')
counts the number of code points (Unicode characters) in a string when its bytes are decoded to characters using the UTF-8 encoding: 14.
(The other example counts are largely meaningless as they're based on treating the input string as one encoding when actually it contains data in a different encoding.)
NSStrings are natively counted as UTF-16 code units. There are 17, not 14, because the above string contains characters like ð
that don't fit in a single UTF-16 code unit, so have to be encoded as a surrogate pair. There aren't any functions that will count strings in UTF-16 code units in PHP, but because each code unit is encoded to two bytes, you can work it out easily enough by encoding to UTF-16 and dividing the number of bytes by two:
strlen(iconv('utf-8', 'utf-16le', $str)) / 2
(Note: the le
suffix is necessary to make iconv
encode to a particular endianness of UTF-16, and not foul up the count by choosing one and adding a BOM to the start of the string to say which one it chose.)
I have included a picture to help illustrate the answer that @bobince gave.
Essentially, all non-surrogate-pair code points end up as two bytes in UTF-16 while all surrogate-pair code points end up as four bytes. If we divide this by two we get the equivalent expected length value.
P.S. Please forgive the error in the image where it says "code points" and should say "code units"