XSS via JSON: Why does a web application not sanitize either its incoming params hash or its outgoing JSON values of malicious tags like Script?

This is often brought up everywhere, partially because everyone keeps repeating the mantra that input sanitation is the answer. It isn't. It's dangerous, bug-prone, and it needs to go away. Of course, you should always check to see if your input lengths are correctly corresponding to their appropriate columns.

Sanitizing Input vs. Sanitizing Output.


Sanitizing input gives people a false sense of security since there are so many ways to get around it, and because it's difficult to properly implement, forcing developers to search google for an outdated implementation that may or may not be secure. For this reason, it's best to sanitize output.

Part of the reason this is done later is because you want to preserve the correct data, but also to protect against SQL injection attacks. SQL injection attacks are largely defeated by prepared statements, not really input sanitation. You should replace all script tags and dangerous outputs with html entities.

Here's an example of replacing potential script characters with html entities on output, not input:

  1. < becomes &lt;, still displays as < on the page, but without messing with the layout, or the database.

  2. > becomes &gt;, still displays as > on the page, but without messing with the layout, or the database.


So Why Not Sanitize Input? What If We Correctly Implement It?

Most developers aren't IT security experts. Most developers wouldn't have a clue what to do in this area. By teaching them these two common methods of protecting data, you save development time and significantly increase the overall security of your web application. Better yet, you help your developers understand why this is necessary, instead of why 2340939403424 different types of input sanitation are needed, and prevent a lot of implementation issues that will invariably pop up later on.

Like I said before, searching google for outdated input sanitation functions is not security. That's a false sense of security. You need to understand what is acceptable, and what is not, and the process that data goes through.

With output sanitation, you don't have to worry about some weird bug happening later down the road that you forgot about. You don't have to mess with tons of different complicated functions that could be implemented incorrectly, and give a false sense of security. You don't have to worry about scripts being injected either.

But what if we replace all html entities before they're inserted? If you try that on the client side, anyone could modify the request. If you check it on the back-end before putting them in the database, that could work too... but then there's a problem:

What if there is a field that requires an actual HTML entity to be placed on it? Let's say a packaging, an address, or something silly like that. Maybe it's a file name, maybe it's something you have implemented in one way.


Example of Why Input Sanitation Sucks

Maybe someone has a funny keyboard that uses a different apostrophe for their name or address? This could leave some databases open to unicode-based smuggling.

Maybe you need a record that has < or > inside of it? What are you going to do? Search by the Html Entity? That's pretty inefficient, and requires a lot of hacks to get working right in many databases.

Maybe someone's name has an apostrophe in it - for example, Rory O'Cune. With input sanitation, you are destroying his name and requiring more code to deal with it. What if one of your employees is searching by last names and can't find him because it's been shortened to OCune? This is awful.

That's another reason why you use parameterized queries, and not input sanitation. With prepared statements and output sanitation, you can do this:

SELECT * FROM [table] WHERE [last_name] = @Lastname -- (or ?)

The @Lastname (would be a ? in Java) parameter/bound variable will be correctly translated to O'Cune. No funny business required. No bugs to hunt down. It's infinitely more secure, and you can just output the html entity if it's a page-breaking character.


So Why Isn't Ruby Fixing This Automatically?

Why should they? I've yet to come across any implementations of JSON in any language that will automatically remove these script tags. What if you want to use JSON to display script tags for some weird reason on your website?

By removing this feature, you are preventing anyone from outputting HTML through JSON, which I suspect could break a lot of things for a lot of people.

Thus, the answer is to do it yourself. Replace those script tags with Html Entities before they get serialized and returned.


Making inputs safe is better done as late as possible. That is, when data is output to the page. Globally "sanitizing" input data is bad because it is impossible to distinguish between good data and bad data at this stage (as you say, if you're allowing HTML to be entered by your users, the framework can't tell the difference between HTML that's meant to be there vs HTML that's not). Even with script in input sometimes this can be valid (think StackOverflow/StackExchange text boxes containing code).

As late as possible then, your application should be encoding to whatever format the output is in. For example, HTML or JSON. With the former and & sign becomes &amp; so that it is properly rendered for display, for the latter a JSON encoder should be used which will convert & to \x26 inside a JSON string.

The JSON itself isn't an XSS risk as script does not execute in a browser from a JSON request (JSONP is another matter as these are included with script src references rather than loaded as data). The XSS risk with JSON is when JavaScript on the page attempts to create or populate HTML elements with the retrieved JSON data. The JavaScript itself should either HTML encode the data or use safe object members to populate the DOM (e.g. textContent).

Sanitization can be done as an extra layer, however proper encoding should be the focus. For example, you may wish to validate server side that a postcode or zipcode only contains alphanumeric characters and the space character. For more complex fields, this isn't an option without severely limiting your input capabilities.

Some frameworks attempt to globally sanitize data, such as .NET with the use of request validation. However, vulnerabilities are found all the time including this very recent one. In short it doesn't work and causes functional issues for languages that attempt it. It also offers no protection from XSS in data that is retrieved from other sources than the application itself.