Is there an established representation for the difference between two JSONs?

As Ira pointed out, there are some options along the lines of Levenshtein, but you'd be looking at serializing your object and comparing it lexicographically which as Ira mentioned would not take into account the JSON-specific language diff that you are looking for (two trees could be identical JSON but very different by Levenshtein distance). What you want is definitely the tree edit distance.

So to add some detail around the art of tree edit distance, the known algorithms used in this space are typically Zhang & Shasha or Klein for example and you can find python implementation of Zhang & Shasha. These algorithms will obtain the minimum number of edits to convert one tree to another thus providing your diff. However, they are somewhat slow O(n^2) at absolute best, so if you are comparing a large number of JSON objects or files, you will find yourself perfecting your golf game, doing the dishes, laundry, bathing your pets and other household miscellany.

And this is really where the art that Ira speaks of really is, because these kinds of algorithms are difficult and computationally expensive. So what you may do is get creative. One method is narrowing down the number of objects that have to be compared. For example, why calculate edit distance between two JSON objects that are clearly more similar to an intermediate than to each other? Don't calculate edit distance on objects that are identical via lexicographic comparison, If two objects are somewhat or dramatically different, perhaps forget the diff and just say there needs to be an outright replacement.

In order to apply the "art" of tree edit distance, that is saving yourself unnecessary CPU cycles, what you need is a way to provide metrics around what is meant by "somewhat similar" or "dramatically different".

To that end I've written an implementation of PQ-Gram tree edit distance approximation algorithm (http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf) that you can find on github for use in Node.js or in the browser (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram).

PQ-Gram is much, much faster than true edit distance algorithms operating in O(n log n) time and O(n) space where n is the number of nodes.

So my recommendation is to use jqgram to very quickly get a feel for what you are looking at in terms of JSON object edit distance metrics. Determine which JSON objects should be compared, versus just replaced, and then when you want the true distance to get the diff utilize Klein or Zhang & Shasha to get the actual diff.

Here's the jqgram JSON object tree edit distance approximation example taken straight out of the README for the jqgram implementation on github:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3, depth:10 },
function(result) {
    console.log(result.distance);
});

The lfn and cfn parameters specify how each JSON tree should determine the node label names and the children array for each tree root independently so that you can do fun things like compare JSON objects from different sources. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees.

JSON documents are essentially trees, with leaves containing name/value pairs.

What you want to do is transmit a minimal tree delta: the smallest set of edits that converts one tree into another.

Computing tree deltas is a bit of an art, partly because it depends on the kinds of deltas you allow (just insert/delete leaf? swap subtrees? move subtree? duplicate subtree? rename names? replace values?). You also need to take into account semantic equivalences; if you commute the position of two subtrees, is the result semantically different? (Your delta detector may see such a tree swap; the semantic identity check may eliminate it as not interesting). If you duplicate a subtree, is the answer semantically different? (I think for JSON the effective answer is "no").

You need something like a dynamic programming algorithm to determine such a minimal delta; you can take inspiration from Levenshtein distance on strings.

This is a common issue of interest to programmers regarding source code. Think of a JSON document as source code, and see the answers at https://stackoverflow.com/q/5779160/120163 for further discussion.

Is there an established representation for the difference between two JSONs?

Tags:

Json

Diff

Related

Recent Posts