Regex to validate JSON
Yes, it's a common misconception that Regular Expressions can match only regular languages. In fact, the PCRE functions can match much more than regular languages, they can match even some non-context-free languages! Wikipedia's article on RegExps has a special section about it.
JSON can be recognized using PCRE in several ways! @mario showed one great solution using named subpatterns and back-references. Then he noted that there should be a solution using recursive patterns (?R)
. Here is an example of such regexp written in PHP:
$regexString = '"([^"\\\\]*|\\\\["\\\\bfnrt\/]|\\\\u[0-9a-f]{4})*"';
$regexNumber = '-?(?=[1-9]|0(?!\d))\d+(\.\d+)?([eE][+-]?\d+)?';
$regexBoolean= 'true|false|null'; // these are actually copied from Mario's answer
$regex = '/\A('.$regexString.'|'.$regexNumber.'|'.$regexBoolean.'|'; //string, number, boolean
$regex.= '\[(?:(?1)(?:,(?1))*)?\s*\]|'; //arrays
$regex.= '\{(?:\s*'.$regexString.'\s*:(?1)(?:,\s*'.$regexString.'\s*:(?1))*)?\s*\}'; //objects
$regex.= ')\Z/is';
I'm using (?1)
instead of (?R)
because the latter references the entire pattern, but we have \A
and \Z
sequences that should not be used inside subpatterns. (?1)
references to the regexp marked by the outermost parentheses (this is why the outermost ( )
does not start with ?:
). So, the RegExp becomes 268 characters long :)
/\A("([^"\\]*|\\["\\bfnrt\/]|\\u[0-9a-f]{4})*"|-?(?=[1-9]|0(?!\d))\d+(\.\d+)?([eE][+-]?\d+)?|true|false|null|\[(?:(?1)(?:,(?1))*)?\s*\]|\{(?:\s*"([^"\\]*|\\["\\bfnrt\/]|\\u[0-9a-f]{4})*"\s*:(?1)(?:,\s*"([^"\\]*|\\["\\bfnrt\/]|\\u[0-9a-f]{4})*"\s*:(?1))*)?\s*\})\Z/is
Anyway, this should be treated as a "technology demonstration", not as a practical solution. In PHP I'll validate the JSON string with calling the json_decode()
function (just like @Epcylon noted). If I'm going to use that JSON (if it's validated), then this is the best method.
Because of the recursive nature of JSON (nested {...}
-s), regex is not suited to validate it. Sure, some regex flavours can recursively match patterns* (and can therefor match JSON), but the resulting patterns are horrible to look at, and should never ever be used in production code IMO!
* Beware though, many regex implementations do not support recursive patterns. Of the popular programming languages, these support recursive patterns: Perl, .NET, PHP and Ruby 1.9.2
Yes, a complete regex validation is possible.
Most modern regex implementations allow for recursive regexpressions, which can verify a complete JSON serialized structure. The json.org specification makes it quite straightforward.
$pcre_regex = '
/
(?(DEFINE)
(?<number> -? (?= [1-9]|0(?!\d) ) \d+ (\.\d+)? ([eE] [+-]? \d+)? )
(?<boolean> true | false | null )
(?<string> " ([^"\\\\]* | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9a-f]{4} )* " )
(?<array> \[ (?: (?&json) (?: , (?&json) )* )? \s* \] )
(?<pair> \s* (?&string) \s* : (?&json) )
(?<object> \{ (?: (?&pair) (?: , (?&pair) )* )? \s* \} )
(?<json> \s* (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) ) \s* )
)
\A (?&json) \Z
/six
';
It works quite well in PHP with the PCRE functions . Should work unmodified in Perl; and can certainly be adapted for other languages. Also it succeeds with the JSON test cases.
Simpler RFC4627 verification
A simpler approach is the minimal consistency check as specified in RFC4627, section 6. It's however just intended as security test and basic non-validity precaution:
var my_JSON_object = !(/[^,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]/.test(
text.replace(/"(\\.|[^"\\])*"/g, ''))) &&
eval('(' + text + ')');