How to parse HTML in PHP?
I have used domdocument and domxpath to get the solution, you can find it at:
<?php
$dom = new DomDocument();
$test='<p class="Heading1-P">
<span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 3</span>
</p>';
$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
$heading=parseToArray($xpath,'Heading1-H');
$content=parseToArray($xpath,'Normal-H');
var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";
function parseToArray($xpath,$class)
{
$xpathquery="//span[@class='".$class."']";
$elements = $xpath->query($xpathquery);
if (!is_null($elements)) {
$resultarray=array();
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
$resultarray[] = $node->nodeValue;
}
}
return $resultarray;
}
}
Live result: http://saji89.codepad.org/2TyOAibZ
Here's an alternative way to parse the html using DiDOM
which offers significantly better performance in terms of speed and memory footprint.
composer require imangazaliev/didom
<?php
use DiDom\Document;
require_once('vendor/autoload.php');
$html = <<<HTML
<p class="Heading1-P">
<span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 3</span>
</p>
HTML;
$document = new Document($html);
// find chapter headings
$elements = $document->find('.Heading1-H');
$headings = [];
foreach ($elements as $element) {
$headings[] = $element->text();
}
// find chapter texts
$elements = $document->find('.Normal-H');
$chapters = [];
foreach ($elements as $element) {
$chapters[] = $element->text();
}
echo("Headings\n");
foreach ($headings as $heading) {
echo("- {$heading}\n");
}
echo("Chapter texts\n");
foreach ($chapters as $chapter) {
echo("- {$chapter}\n");
}
One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.
Read the following in php.net
http://php.net/manual/en/class.domdocument.php
http://php.net/manual/en/class.domxpath.php
Hope this helps.
Try to look at PHP Simple HTML DOM Parser
It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class
// include/require the simple html dom parser file
$html_string = '
<p class="Heading1-P">
<span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
<span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
<span class="Normal-H">This is chapter 3</span>
</p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
if ($element->class === 'Heading1-H') {
$heading[] = $element->innertext;
}else if($element->class === 'Normal-H') {
$content[] = $element->innertext;
}
}