Minifying final HTML output using regular expressions with CodeIgniter

For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:

function process_data_alan($text) // 
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre)\b)
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}

I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions.

There is an unnecessary capture group which can be removed.
Although the OP did not say so, the <SCRIPT> element should be added to the <PRE> and <TEXTAREA> blacklist.
Adding the 'S' PCRE "study" modifier speeds up this regex by about 20%.
There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
On a more serious note, this same alternation group: (i.e. (?:[^<]++|<(?!/?(?:textarea|pre)\b))*+) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).

Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:

// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) // 
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script)\b)
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.\n");
    return $text;
}

p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.

Hope this helps.

Sorry for not commenting, reputation missing ;)

I want to urge everybody not to implement such regex without checking for performance penalties. Shopware implemented the first regex (from Alan/ridgerunner) for their HTML minify and "blow up" every shop with bigger pages.

If possible, a combined solution (regex + some other logic) is most of the time faster and more maintainable (except you are Damian Conway) for complex problems.

Also i want to mention, that most minifier can break your code (JavaScript and HTML), when in a script-block itself is another script-block via document.write i.e.

Attached my solution (an optimized version off user2677898 snippet). I simplified the code and run some tests. Under PHP 7.2 my version was ~30% faster for my special testcase. Under PHP 7.3 and 7.4 the old variant gained much speed and is only ~10% slower. Also my version is still better maintainable due to less complex code.

function filterHtml($content) {
{
    // List of untouchable HTML-tags.
    $unchanged = 'script|pre|textarea';

    // It is assumed that this placeholder could not appear organically in your
    // output. If it can, you may have an XSS problem.
    $placeholder = "@@<'-pLaChLdR-'>@@";

    // Some helper variables.
    $unchangedBlocks  = [];
    $unchangedRegex   = "!<($unchanged)[^>]*?>.*?</\\1>!is";
    $placeholderRegex = "!$placeholder!";

    // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
    $content = preg_replace_callback(
        $unchangedRegex,
        function ($match) use (&$unchangedBlocks, $placeholder) {
            array_push($unchangedBlocks, $match[0]);
            return $placeholder;
        },
        $content
    );

    // Remove HTML comments, but not SSI
    $content = preg_replace('/<!--[^#](.*?)-->/s', '', $content);

    // Remove whitespace (spaces, newlines and tabs)
    $content = trim(preg_replace('/[ \n\t]{2,}|[\n\t]/m', ' ', $content));

    // Replace the placeholders with the original content.
    $content = preg_replace_callback(
        $placeholderRegex,
        function ($match) use (&$unchangedBlocks) {
            // I am a paranoid.
            if (count($unchangedBlocks) == 0) {
                throw new \RuntimeException("Found too many placeholders in input string");
            }
            return array_shift($unchangedBlocks);
        },
        $content
    );

    return $content;
}

I implemented the answer from @ridgerunner in two projects, and ended up hitting some severe slowdowns (10-30 second request times) in staging for one of the projects. I found out that I had to set both pcre.recursion_limit and pcre.backtrack_limit quite low for it to even work, but even then it would give up after about 2 senconds of processing and return false.

Since that, I've replaced it with this solution (with easier-to-grasp regex), which is inspired by the outputfilter.trimwhitespace function from Smarty 2. It does no backtracking or recursion, and works every time (instead of catastrophically failing once in a blue moon):

function filterHtml($input) {
    // Remove HTML comments, but not SSI
    $input = preg_replace('/<!--[^#](.*?)-->/s', '', $input);

    // The content inside these tags will be spared:
    $doNotCompressTags = ['script', 'pre', 'textarea'];
    $matches = [];

    foreach ($doNotCompressTags as $tag) {
        $regex = "!<{$tag}[^>]*?>.*?</{$tag}>!is";

        // It is assumed that this placeholder could not appear organically in your
        // output. If it can, you may have an XSS problem.
        $placeholder = "@@<'-placeholder-$tag'>@@";

        // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
        $input = preg_replace_callback(
            $regex,
            function ($match) use ($tag, &$matches, $placeholder) {
                $matches[$tag][] = $match[0];
                return $placeholder;
            },
            $input
        );
    }

    // Remove whitespace (spaces, newlines and tabs)
    $input = trim(preg_replace('/[ \n\t]+/m', ' ', $input));

    // Iterate the blocks we replaced with placeholders beforehand, and replace the placeholders
    // with the original content.
    foreach ($matches as $tag => $blocks) {
        $placeholder = "@@<'-placeholder-$tag'>@@";
        $placeholderLength = strlen($placeholder);
        $position = 0;

        foreach ($blocks as $block) {
            $position = strpos($input, $placeholder, $position);
            if ($position === false) {
                throw new \RuntimeException("Found too many placeholders of type $tag in input string");
            }
            $input = substr_replace($input, $block, $position, $placeholderLength);
        }
    }

    return $input;
}

Minifying final HTML output using regular expressions with CodeIgniter

Tags:

Php

Regex

Compression

Codeigniter

Related

Recent Posts