How to replace decoded Non-breakable space (nbsp)
Sanitize every type of white spaces.
preg_replace("/\s+/u", " ", $str);
https://stackoverflow.com/a/40264711/635364
FYI, PHP Sanitization filter_var() has no filter about these white spaces.
Problem Explanation
The reason why it's not working is that you are specifying the non-breaking space incorrectly.
The proper code for the non-breaking space in the UTF-8 encoding is 0xC2A0
, it consists of two bytes - 0xC2
(194
) and 0xA0
(160
), so technically, you're specifying only the half of the character's code.
A Bit of Theory
Legacy character encodings were using the constant number of bits to encode every character in their set. For example, the original ASCII encoding was using 7 bits per character, extended ASCII 8 bits.
The UTF-8 encoding is so-called variable width character encoding, which means that the number of bits used to represent individual characters is variable, in the case of UTF-8, character codes consist of one up to four (8 bit) bytes (octets). In general, similarly to the Huffman coding, more frequently used characters have shorter codes while more rare characters have longer codes. That helps reduce the data size of the average text.
Solution
You can replace all occurences of the UTF-8 non-breaking space in text using a simple (and fast) str_replace
or using a more flexible regular expression, depending on your needs:
// faster solution
$regular_spaces = str_replace("\xc2\xa0", ' ', $original_string);
// more flexible solution
$regular_spaces = preg_replace('/\xc2\xa0/', ' ', $original_string);
Notes
Note that in case of str_replace
, you have to use double quotes ("
) to enclose the search string because it doesn't understand the textual representation of character codes so it needs those codes to be converted into actual characters first. That's made automatically by PHP because strings enclosed in double quotes are being processed and special sequences (e.g. newline character \n
, textual representation of character codes, etc.) are replaced by actual characters (e.g. 0x0A
for \n
in UTF-8) before the string value is being used.
In contrast, the preg_replace
function itself understands the textual representation of character codes so you don't need PHP to convert them into actual characters and you can use apostrophes (single quotes, '
) to enclose the search string in this case.