What is the correct way to detect whether string inputs contain HTML or not?
I don't think you need to implement a huge algorithm to check if string has unsafe data - filters and regular expressions do the work. But, if you need a more complex check, maybe this will fit your needs:
<?php
$strings = array();
$strings[] = <<<EOD
';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>
EOD;
$strings[] = <<<EOD
'';!--"<XSS>=&{()}
EOD;
$strings[] = <<<EOD
<SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
EOD;
$strings[] = <<<EOD
This is a safe text
EOD;
$strings[] = <<<EOD
<IMG SRC="javascript:alert('XSS');">
EOD;
$strings[] = <<<EOD
<IMG SRC=javascript:alert('XSS')>
EOD;
$strings[] = <<<EOD
<IMG SRC=javascript:alert('XSS')>
EOD;
$strings[] = <<<EOD
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
EOD;
$strings[] = <<<EOD
<SCRIPT/XSS SRC="http://ha.ckers.org/xss.js"></SCRIPT>
EOD;
$strings[] = <<<EOD
</TITLE><SCRIPT>alert("XSS");</SCRIPT>
EOD;
libxml_use_internal_errors(true);
$sourceXML = '<root><element>value</element></root>';
$sourceXMLDocument = simplexml_load_string($sourceXML);
$sourceCount = $sourceXMLDocument->children()->count();
foreach( $strings as $string ){
$unsafe = false;
$XML = '<root><element>'.$string.'</element></root>';
$XMLDocument = simplexml_load_string($XML);
if( $XMLDocument===false ){
$unsafe = true;
}else{
$count = $XMLDocument->children()->count();
if( $count!=$sourceCount ){
$unsafe = true;
}
}
echo ($unsafe?'Unsafe':'Safe').': <pre>'.htmlspecialchars($string,ENT_QUOTES,'utf-8').'</pre><br />'."\n";
}
?>
In a comment above, you wrote:
Just stop the browser from treating the string as markup.
This is an entirely different problem to the one in the title. The approach in the title is usually wrong. Stripping out tags just mangles input and can lead to data loss. Ever tried to talk about HTML on a blog that strips tags? Frustrating.
The solution that is usually the correct one is to do as you said in your comment - to stop the browser from treating the string as markup. This - literally taken - is not possible. What you do instead is encode the content as HTML.
Consider the following data:
<strong>Test</strong>
Now, you can look at this one of two ways. You can look at it as literal data - a sequence of characters. You can look at it as HTML - markup that includes strongly emphasises text.
If you just dump that out into an HTML document, you are treating it as HTML. You can't treat it as literal data in that context. What you need is HTML that will output the literal data. You need to encode it as HTML.
Your problem is not that you have too much HTML - it's that you have too little. When you output <
, you are outputting raw data in an HTML context. You need to convert it to <
, which is the HTML representation of that data before outputting it.
PHP offers a few different options for doing this. The most direct is to use htmlspecialchars()
to convert it into HTML, and then nl2br()
to convert the line breaks into <br>
elements.
HTML Purifier does a good job and is very easy to implement. You could also use a Zend Framework filter like Zend_Filter_StripTags.
HTML Purifier doesn't just fix HTML.
If you're just "looking for protection for print '<h3>' . $name . '</h3>'
", then yes, at least the
second approach is adequate, since it checks whether the value would be interpreted as markup if it weren't
escaped. (In this case, the area where $name
would appear is element content, and only the characters &
, <
, and >
have special meaning when they appear in element content.) (For href
and similar attributes, the check for "JavaScript: " may be necessary, but as you stated in a comment, that isn't a goal.)
For official sources, I can refer to the XML specification:
Content production in section 3.1: Here, content consists of elements, CDATA sections, processing instructions, and comments (which must begin with
<
), references (which must begin with&
), and character data (which contains any other legal character). (Although a leading>
is treated as character data in element content, many people usually escape it along with<
, and it's better safe than sorry to treat it as special.)Attribute value production in section 2.3: A valid attribute value consists of either references (which must begin with
&
) or character data (which contains any other legal character, but not<
or the quote symbol used to wrap the attribute value). If you need to place string inputs in attributes in addition to element content, the characters"
and'
need to be checked in addition to&
,<
, and possibly>
(and other characters illegal in XML).Section 2.2: Defines what Unicode code points are legal in XML. In particular, null is illegal in an XML document and may not display properly in HTML.
HTML5 (the latest working draft, which is a work in progress, describes a very elaborate parsing algorithm for HTML documents:
- Element content corresponds to the "data state" in the parsing algorithm.
Here, the string input should not contain a null character,
<
(which begins a new tag), or&
(which begins a character reference). - Attribute values correspond to the "before attribute value state"
in the parsing algorithm.
For simplicity, we assume the attribute value is wrapped in double quotation marks. In that case, the parser moves to the
"attribute value (double-quoted) state".
In this case, the string input should not contain a null character,
"
(which ends the attribute value), or&
(which begins a character reference).
If string inputs are to be placed in attribute values (unless placing them there is solely for display purposes), there are additional considerations to keep in mind. For example, HTML 4 specifies:
User agents should interpret attribute values as follows:
- Replace character entities with characters,
- Ignore line feeds,
- Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values[.]
Attribute value normalization is also specified in the XML specification, but apparently not in HTML5.
EDIT (Apr. 25, 2019): Also, be suspicious of inputs containing—
- the null code point (as it can cause parse errors in certain places, as specified in the HTML5 specification), or
- any code point illegal in XML (as it will cause parse errors upon reading the XML document),
...assuming htmlspecialchars
doesn't escape those code points already.