How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?
Looks like when you map to text you get XML entities replaced, but when you instead work with the nodes and use their content, the entities are preserved. This minimal example:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<p>this & "that"</p>');
for my $phrase ($dom->find('p')->each) {
print $phrase->content(), "\n";
}
prints:
this & "that"
If you want to keep your loop and map, replace map('text')
with map('content')
like this:
for my $phrase ($dom->find('p')->map('content')->each) {
If you have nested tags and want to find only the texts (but not print those nested tag names, only their contents), you'll need to scan the DOM tree:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new('<p><i>this & <b>"</b><b>that</b><b>"</b></i></p><p>done</p>');
for my $node (@{$dom->find('p')->to_array}) {
print_content($node);
}
sub print_content {
my ($node) = @_;
if ($node->type eq "text") {
print $node->content(), "\n";
}
if ($node->type eq "tag") {
for my $child ($node->child_nodes->each) {
print_content($child);
}
}
}
which prints:
this &
"
that
"
done