How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?

Looks like when you map to text you get XML entities replaced, but when you instead work with the nodes and use their content, the entities are preserved. This minimal example:

Click to copy

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

prints:

Click to copy

this &amp; &quot;that&quot;

If you want to keep your loop and map, replace map('text') with map('content') like this:

Click to copy

for my $phrase ($dom->find('p')->map('content')->each) {

If you have nested tags and want to find only the texts (but not print those nested tag names, only their contents), you'll need to scan the DOM tree:

Click to copy

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

which prints:

Click to copy

this & 
"
that
"
done

How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?

Tags:

Perl

Html Entities

Mojolicious

Movabletype

Related

Recent Posts