How do I calculate tree edit distance?

I wrote an implementation (https://github.com/hoonto/jqgram.git) based on the existing PyGram Python code (https://github.com/Sycondaman/PyGram) for those of you who wish to use tree edit distance approximation using PQ-Gram algorithm in the browser and/or in Node.js.

The jqgram tree edit distance approximation module implements the PQ-Gram algorithm for both server-side and browser-side applications; O(n log n) time and O(n) space performant where n is the number of nodes. See the academic paper from which the algorithm comes: http://www.vldb2005.org/program/paper/wed/p301-augsten.pdf

The PQ-Gram approximation is much faster than obtaining the true edit distance via Zhang & Shasha, Klein, or Guha et. al, whom provide true edit distance algorithms that all perform minimum O(n^2) time and are therefore often unsuitable.

Often in real-world applications it is not necessary to know the true edit distance if a relative approximation of multiple trees to a known standard can be obtained. JavaScript, in the browser and now on the server with the advent of Node.js deal frequently with tree structures and end-user performance is usually critical in algorithm implementation and design; thus jqgram.

Example:

var jq = require("jqgram").jqgram;
var root1 = {
    "thelabel": "a",
    "thekids": [
        { "thelabel": "b",
        "thekids": [
            { "thelabel": "c" },
            { "thelabel": "d" }
        ]},
        { "thelabel": "e" },
        { "thelabel": "f" }
    ]
}

var root2 = {
    "name": "a",
    "kiddos": [
        { "name": "b",
        "kiddos": [
            { "name": "c" },
            { "name": "d" },
            { "name": "y" }
        ]},
        { "name": "e" },
        { "name": "x" }
    ]
}

jq.distance({
    root: root1,
    lfn: function(node){ return node.thelabel; },
    cfn: function(node){ return node.thekids; }
},{
    root: root2,
    lfn: function(node){ return node.name; },
    cfn: function(node){ return node.kiddos; }
},{ p:2, q:3, depth:10 },
function(result) {
    console.log(result.distance);
});

Note that the lfn and cfn parameters specify how each tree should determine the node label names and the children array for each tree root independently so that you can do funky things like comparing an object to a browser DOM for example. All you need to do is provide those functions along with each root and jqgram will do the rest, calling your lfn and cfn provided functions to build out the trees. So in that sense it is (in my opinion anyway) much easier to use than PyGram. Plus, it’s JavaScript, so use it client or server-side!

Now one approach you can use is to use jqgram or PyGram to get a few trees that are close and then go on to use a true edit distance algorithm against a smaller set of trees. Why spend all the computation on trees you can already easily determine are very distant, or vice versa? So you can use jqgram to narrow down choices too.


This Python library implements the Zhang-Shasha algorithm correctly: Zhang-Shasha: Tree edit distance in Python

It began as a direct port of the Java source listed in the currently accepted answer (the one with the tarball link), but that implementation is not correct and is nearly impossible to run at all.


Here's some Java source code (gzipped tarball at the bottom) for a tree edit distance algorithm that might be useful to you.

The page includes references and some slides that go through the "Zhang and Shasha" algorithm step-by-step and other useful links to get you up to speed.

The code in the link has bugs. Steve Johnson and tim.tadh have provided working Python code. See Steve Johnson's comment for more details.

Tags:

Algorithm

Tree