Python implementation of a graph-similarity-grading algorithm

What we ended up doing is implementing an algorithm described in: "Heuristics for Chemical Compound Matching".

We're using NetworkX for representing the grap and for finding the max clique.

Edit:

Basically, you create a new graph, each node (v) representing a possible pairing of a node from graph A (a) to a node from graph B (b).

If in your application the two nodes (a,b) are either similar or not, you remove nodes (v) from the new graph which correspond to dissimilar pairings (a,b). You connect two nodes with an edge if they don't contradict each other. For example, the pairings (a,b) and (a,c) contradict each other (see article for a formal definition). You then find a clique in the new graph which has the maximum amount of nodes.

If in your application the two nodes' similarity is not binary, you give the new nodes weights within a range (say (0,1)). You can remove, heuristically, remove new nodes with similarity grades lower than a predefined threshold. You then find a clique in the new graph which has the maximum weight (the sum of the nodes' assigned weights).

Either way, you finish up by generating the similarity grade: the size / total weight of the clique divided by a function of the attributes of the original graphs (maximum/minimum/average of the sizes / weights of A and B ).

A nice feature is that the you can deduce the "source" of the similarity from the clique you found - the "stronger" pairings.

Further clarification: The constraints are application dependent. We used the approach to compare pairs of function control-flow graphs. Generally, the approach finds a matching of some nodes in the first graph to some nodes in the second graph (subgraph to subgraph). Each node in the association graph symbolizes a possible matching of a single node from the first graph to a single node in the second graph. Since eventually a clique (a subset of the nodes) is selected, an edge means two matchings don't contradict each other. To apply for a different application, you should ask what are the criteria for possible pairings (or what nodes do I create), and how does selecting one pairing influence the selection of another pairing (or how do I connect the nodes with edges).


I think the way you define similarity is not very standard, so it is probably easier to find the implementations for the isomorphism check, graph edit distance and maximum common subgraph and then combine those yourself.

One should understand, however, that calculating such a similarity measure might be costly, so if there're a lot of graphs you might want to do some sort of screening first.

igraph can check for isomorphism, for example.

EDIT: actually you probably only need the edit distance, since the presence of an MCS is usually irrelevant, as I pointed out above, and two graphs will be isomorphic anyway if the edit distance is 0.


Another method is to use what is called Eigenvector Similarity. Basically, you calculate the Laplacian eigenvalues for the adjacency matrices of each of the graphs. For each graph, find the smallest k such that the sum of the k largest eigenvalues constitutes at least 90% of the sum of all of the eigenvalues. If the values of k are different between the two graphs, then use the smaller one. The similarity metric is then the sum of the squared differences between the largest k eigenvalues between the graphs. This will produce a similarity metric in the range [0, ∞), where values closer to zero are more similar.

For example, if using networkx:

def select_k(spectrum, minimum_energy = 0.9):
    running_total = 0.0
    total = sum(spectrum)
    if total == 0.0:
        return len(spectrum)
    for i in range(len(spectrum)):
        running_total += spectrum[i]
        if running_total / total >= minimum_energy:
            return i + 1
    return len(spectrum)

laplacian1 = nx.spectrum.laplacian_spectrum(graph1)
laplacian2 = nx.spectrum.laplacian_spectrum(graph2)

k1 = select_k(laplacian1)
k2 = select_k(laplacian2)
k = min(k1, k2)

similarity = sum((laplacian1[:k] - laplacian2[:k])**2)

This is an old question but I would like to share my approach. I had a CVRP (Capacitated Vehicle Routing Problem) task. My heuristic algorithm produced several different graphs in order to find a solution. In order to not get stuck at a local optimum, I used a relax and repair procedure.

At this point, I had to filter out solutions that were too similar. Since most heuristic approaches use a systematic change of neighborhoods within a local search procedure to provide solutions an Edit distance (Levenshtein distance) was perfect for me. Levenshtein algorithm has a complexity of O(n*m) where n and m is the length of two strings. So with a string representation of the graph nodes and routes I was able to figure out the similarity. The edit operations can be considered neighborhood operations so it can be considered a search space distance (not a solution space distance).

A better / generalized approach that sacrifices some speed would be Needleman-Wunsch algorithm. Needleman-Wunsch is a hybrid similarity measure that generalizes the Levenshtein distance and considers global alignment between two strings. Specifically, it is computed by assigning a score to each alignment between the two input strings and choosing the score of the best alignment, that is the maximal score. An alignment between two strings is a set of correspondences between their characters, allowing for gaps.

For example:

import py_stringmatching as sm
nw = sm.NeedlemanWunsch(gap_cost=0.5, sim_func=lambda s1, s2: (0.0 if s1 == s2 else 1.0))
print('\nNeedleman-Wunsch: ', nw.get_raw_score('045601230', '062401530'))

In the example, you can use a custom Levenshtein algorithm.

Fast implementations of Levenshtein exist in Git (using Cython, Numpy etc).
A nice library is py_stringmatching which contain the following list of similarity algorithms:

  • Affine Gap
  • Bag Distance
  • Cosine
  • Dice
  • Editex
  • Generalized Jaccard
  • Hamming Distance
  • Jaccard
  • Jaro
  • Jaro Winkler
  • Levenshtein
  • Monge Elkan
  • Needleman Wunsch
  • Overlap Coefficient
  • Partial Ratio
  • Partial Token Sort
  • Ratio
  • Smith-Waterman
  • Soft TF/IDF
  • Soundex
  • TF/IDF
  • Token Sort
  • Tversky Index