JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.

The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.

Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).

Some frameworks, including PHP's implementation of JSON, always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that JSON decoders have a problem with UTF-8.

So, I guess you just could decide which to use like this:

  • Just use UTF-8, unless your method of storage or transport between encoder and decoder is not binary-safe.

  • Otherwise, use the numeric escape sequences.


I had a problem there. When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".

Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().

Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.


ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F)