What is DOMDocument doing to my string?

You can add an xml encoding tag (and take it out later). This works for me on things that are not stock Centos 5.x (ubuntu, cpanel's php):

<?php
$dom = new DOMDocument('1.0', 'UTF-8');
$str = '<p>Hello®</p>';
var_dump(mb_detect_encoding($str)); 
$dom->loadHTML('<?xml encoding="utf-8">'.$str);
var_dump($dom->saveHTML()); 

This is what you get:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&reg;</p></body></html>

Except on days when you get this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<?xml encoding="utf-8"><html><body><p>Hello&Acirc;&reg;</p></body></html>

I fixed this decoding the UTF-8 before passing it to loadHTML.

$dom->loadHTML( utf8_decode( $html ) );

saveHTML() seems to decode special chars like German umlauts to their HTML entities. (Although I set $dom->substituteEntities=false;... o.O)

This is quite strange, though, as the documentation states:

The DOM extension uses UTF-8 encoding.

(http://www.php.net/manual/de/class.domdocument.php, search for utf8)

Oh dear, encoding in PHP poses problems again and again... never ending story.


Your text editor says "®" in UTF-8, but the bytes in the file say "®" in Latin-1 (or a similar encoding), which is what PHP is using to read it. Using the character entity reference will remove this ambiguity.

>>> print u'®'.encode('utf-8').decode('latin-1')
®