Multibyte trim in PHP?
Ok, so I took @edson-medina's solution and fixed a bug and added some unit tests. Here's the 3 functions we use to give mb counterparts to trim, rtrim, and ltrim.
////////////////////////////////////////////////////////////////////////////////////
//Add some multibyte core functions not in PHP
////////////////////////////////////////////////////////////////////////////////////
function mb_trim($string, $charlist = null) {
if (is_null($charlist)) {
return trim($string);
} else {
$charlist = preg_quote($charlist, '/');
return preg_replace("/(^[$charlist]+)|([$charlist]+$)/us", '', $string);
}
}
function mb_rtrim($string, $charlist = null) {
if (is_null($charlist)) {
return rtrim($string);
} else {
$charlist = preg_quote($charlist, '/');
return preg_replace("/([$charlist]+$)/us", '', $string);
}
}
function mb_ltrim($string, $charlist = null) {
if (is_null($charlist)) {
return ltrim($string);
} else {
$charlist = preg_quote($charlist, '/');
return preg_replace("/(^[$charlist]+)/us", '', $string);
}
}
////////////////////////////////////////////////////////////////////////////////////
Here's the unit tests I wrote for anyone interested:
public function test_trim() {
$this->assertEquals(trim(' foo '), mb_trim(' foo '));
$this->assertEquals(trim(' foo ', ' o'), mb_trim(' foo ', ' o'));
$this->assertEquals('foo', mb_trim(' Åfooホ ', ' Åホ'));
}
public function test_rtrim() {
$this->assertEquals(rtrim(' foo '), mb_rtrim(' foo '));
$this->assertEquals(rtrim(' foo ', ' o'), mb_rtrim(' foo ', ' o'));
$this->assertEquals('foo', mb_rtrim('fooホ ', ' ホ'));
}
public function test_ltrim() {
$this->assertEquals(ltrim(' foo '), mb_ltrim(' foo '));
$this->assertEquals(ltrim(' foo ', ' o'), mb_ltrim(' foo ', ' o'));
$this->assertEquals('foo', mb_ltrim(' Åfoo', ' Å'));
}
This version supports the second optional parameter $charlist:
function mb_trim ($string, $charlist = null)
{
if (is_null($charlist)) {
return trim ($string);
}
$charlist = str_replace ('/', '\/', preg_quote ($charlist));
return preg_replace ("/(^[$charlist]+)|([$charlist]+$)/us", '', $string);
}
Does not support ".." for ranges though.
The standard trim
function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0
to 0100 0000
.
Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx
. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx
.
This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx
can only refer to single-byte characters. PHP's trim
function will therefore never trim away "half a character" assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)
The \s
on ASCII regular expressions will mostly match the same characters as trim
.
The preg
functions with the /u
modifier only works on UTF-8 encoded regular expressions, and /\s/u
match also the UTF8's nbsp. This behaviour with non-breaking spaces is the only advantage to using it.
If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.
In other words, if you're trying to trim usual spaces an ASCII-compatible string, just use trim
. When using /\s/u
be careful with the meaning of nbsp for your text.
Take care:
$s1 = html_entity_decode(" Hello   "); // the NBSP
$s2 = " 𩸽 exotic test ホ 𩸽 ";
echo "\nCORRECT trim: [". trim($s1) ."], [". trim($s2) ."]";
echo "\nSAME: [". trim($s1) ."] == [". preg_replace('/^\s+|\s+$/','',$s1) ."]";
echo "\nBUT: [". trim($s1) ."] != [". preg_replace('/^\s+|\s+$/u','',$s1) ."]";
echo "\n!INCORRECT trim: [". trim($s2,'𩸽 ') ."]"; // DANGER! not UTF8 safe!
echo "\nSAFE ONLY WITH preg: [".
preg_replace('/^[𩸽\s]+|[𩸽\s]+$/u', '', $s2) ."]";
I don't know what you're trying to do with that endless recursive function you're defining, but if you just want a multibyte-safe trim, this will work.
function mb_trim($str) {
return preg_replace("/^\s+|\s+$/u", "", $str);
}