PHP "pretty print" HTML (not Tidy)
you're right, there seems to be no indentation for HTML (others are also confused). XML works, even with loaded code.
function tidyHTML($buffer) {
// load our document into a DOM object
$dom = new DOMDocument();
// we want nice output
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
// start output buffering, using our nice
// callback function to format the output.
<title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>It's like comparing apples to oranges.</p></body></html>
// this will be called implicitly, but we'll
// call it manually to illustrate the point.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
<title>foo bar</title>
<meta name="bar" value="foo">
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
the same with saveXML() ...
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "">
<title>foo bar</title>
<meta name="bar" value="foo"/>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
probably forgot to set preserveWhiteSpace=false before loadHTML?
disclaimer: i stole most of the demo code from tyson clugg/php manual comments. lazy me.
UPDATE: i now remember some years ago i tried the same thing and ran into the same problem. i fixed this by applying a dirty workaround (wasn't performance critical): i just somehow converted around between SimpleXML and DOM until the problem vanished. i suppose the conversion got rid of those nodes. maybe load with dom, import with
, then output the string, parse this with DOM again and then printed it pretty. as far as i remember this worked (but it was really slow).
The result:
<!DOCTYPE html>
<title>My website</title>
Please consider:
function indentContent($content, $tab="\t"){
$content = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $content); // add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
$token = strtok($content, "\n"); // now indent the tags
$result = ''; // holds formatted version as it is built
$pad = 0; // initial indent
$matches = array(); // returns from preg_matches()
// scan each line and adjust indent based on opening/closing tags
while ($token !== false && strlen($token)>0){
$padPrev = $padPrev ?: $pad; // previous padding //Artis
$token = trim($token);
// test for the various tag states
if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)){// 1. open and closing tags on same line - no change
}elseif(preg_match('/^<\/\w/', $token, $matches)){// 2. closing tag - outdent now
if($indent>0) $indent=0;
}elseif(preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches)){// 3. opening tag - don't pad this one, only subsequent tags (only if it isn't a void tag)
foreach($matches as $m){
if (preg_match('/^<(area|base|br|col|command|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)/im', $m)){// Void elements according to
$indent = 1;
}else{// 4. no indentation needed
$indent = 0;
if ($token == "<textarea>") {
$line = str_pad($token, strlen($token) + $pad, $tab, STR_PAD_LEFT); // pad the line with the required number of leading spaces
$result .= $line; // add to the cumulative result, with linefeed
$token = strtok("\n"); // get the next token
$pad += $indent; // update the pad size for subsequent lines
} elseif ($token == "</textarea>") {
$line = $token; // pad the line with the required number of leading spaces
$result .= $line . "\n"; // add to the cumulative result, with linefeed
$token = strtok("\n"); // get the next token
$pad += $indent; // update the pad size for subsequent lines
} else {
$line = str_pad($token, strlen($token) + $pad, $tab, STR_PAD_LEFT); // pad the line with the required number of leading spaces
$result .= $line . "\n"; // add to the cumulative result, with linefeed
$token = strtok("\n"); // get the next token
$pad += $indent; // update the pad size for subsequent lines
if ($voidTag) {
$voidTag = false;
return $result;
//$htmldoc - DOMdocument Object!
$niceHTMLwithTABS = indentContent($htmldoc->saveHTML(), $tab="\t");
echo $niceHTMLwithTABS;
Will result in HTML that has:
- Indentation based on "levels"
- Line breaks after block level elements
- While inline and self-closing elements are not affected
The function (which is a method for class I use) is largely based on: