PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

The best workaround

Pay attention:

The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.

  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&amp;','&gt;','&lt;');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }

To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";

Use the DTD when loading the JATS XML document, as it will define any mapping from named entities to Unicode characters, then set the encoding to UTF-8 when saving:

$doc = new DOMDocument;
$doc->load($inputFile, LIBXML_DTDLOAD | LIBXML_NOENT);
$doc->encoding = 'UTF-8';
$doc->save($outputFile);

I had the same problem because someone used HTML templates to create XML, instead of using SimpleXML. sigh... Anyway, I came up with the following. It's not as fast as yours, but it's not an order of magnitude slower, and it is less hacky. Yours will inadvertently convert #_x_amp#; to $amp;, however unlikely its presence in the source XML.

Note: I'm assuming default encoding is UTF-8

// Search for named entities (strings like "&abc1;").
echo preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

/* <Foo>€&amp;foo Ç</Foo> */

Also, if you want to replace special characters with numbered entities (in case you don't want a UTF-8 XML), you can easily add a function to the above code:

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

echo mb_encode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>&#8364;&amp;foo &#199;</Foo> */

In your case you want it the other way around. Encode numbered entities as UTF-8:

// Search for named entities (strings like "&abc1;").
$xml_utf8 = preg_replace_callback('#&[A-Z0-9]+;#i', function ($matches) {
    // Decode the entity and re-encode as XML entities. This means "&amp;"
    // will remain "&amp;" whereas "&euro;" becomes "€".
    return htmlentities(html_entity_decode($matches[0]), ENT_XML1);
}, "<Foo>&euro;&amp;foo &Ccedil;</Foo>") . "\n";

// Encodes (uncaught) numbered entities to UTF-8.
echo mb_decode_numericentity($xml_utf8, [0x80, 0xffff, 0, 0xffff]);

/* <Foo>€&amp;foo Ç</Foo> */

Benchmark

I've added a benchmark for good measure. This also demonstrates the flaw in your solution for clarity. Below is the input string I used.

<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>

Your method

php -r '$q=["&amp;","&gt;","&lt;"];$y=["#_x_amp#;","#_x_gt#;","#_x_lt#;"]; $s=microtime(1); for(;++$i<1000000;)$r=str_replace($y,$q,html_entity_decode(str_replace($q,$y,"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),ENT_HTML5|ENT_NOQUOTES)); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é &amp; ∬</Foo>
=====
Time taken: 2.0397531986237

My method

php -r '$s=microtime(1); for(;++$i<1000000;)$r=preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é #_x_amp#; &#8748;</Foo>
=====
Time taken: 4.045273065567

My method (with unicode to numbered entity):

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_encode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#; &#8748;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>&#8364;&amp;foo &#199; &#233; #_x_amp#; &#8748;</Foo>
=====
Time taken: 5.4407880306244

My method (with numbered entity to unicode):

php -r '$s=microtime(1); for(;++$i<1000000;)$r=mb_decode_numericentity(preg_replace_callback("#&[A-Z0-9]+;#i",function($m){return htmlentities(html_entity_decode($m[0]),ENT_XML1);},"<Foo>&euro;&amp;foo &Ccedil; &eacute; #_x_amp#;</Foo>"),[0x80,0xffff,0,0xffff]); $t=microtime(1)-$s; echo"$r\n=====\nTime taken: $t\n";'

<Foo>€&amp;foo Ç é #_x_amp#; ∬</Foo>
=====
Time taken: 5.5400078296661

PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

The best workaround

Benchmark

Your method

My method

My method (with unicode to numbered entity):

My method (with numbered entity to unicode):

Tags:

Php

Xml

Converter

Entity

Related

Recent Posts