How to efficiently calculate triad census in undirected graph in python
The idea is simple: Instead of working on the graph directly I use the adjacency matrix. I thought this would be more efficient, and it seems I was right.
In an adjacency matrix a 1 indicates there is an edge between the two nodes, for example the first row can be read as "There is a link between A and B as well as C"
From there I looked at your four types and found the following:
for type 3 there must be an edge between a N1 and N2, N1 and N3 and between N2 and N3. In the adjacency matrix we can find this by going over each row (where each row represents a node and its connections, this is N1) and find nodes it is connected to (that would be N2). Then, in the row of N2 we check all connected nodes (this is N3) and keep those where there is a positive entry in the row of N1. An example of this is "A, B, C", A has a connection to B. B has a connection to C, and A also has a connection to C
for type 2 it works almost identical to type 3. Except now we want to find a 0 for the N3 column in the row of N1. An example of this is "A, B, D". A has a connection to B, B has a 1 in the D column, but A does not.
for type 1 we just look at the row of N2 and find all columns for which both the N1 row and N2 row have a 0.
lastly, for type 0 look at all columns in the N1 row for which the entry is 0, and then check the rows for those, and find all the columns that have a 0 as well.
This code should work for you. For 1000 nodes it took me about 7 minutes (on a machine with a i7-8565U CPU) which is still relatively slow, but a far cry from the multiple days it currently takes you to run your solution. I have included the example from your pictures so you can verify the results. Your code produces a graph that is different from the example you show below by the way. The example graph in the code and the adjacency matrix both refer to the picture you have included.
The example with 1000 nodes uses networkx.generators.random_graphs.fast_gnp_random_graph. 1000 is the number of nodes, 0.1 is the probability for edge creation, and the seed is just for consistency. I have set the probability for edge creation because you mentioned your graph is sparse.
networkx.linalg.graphmatrix.adjacency_matrix: "If you want a pure Python adjacency matrix representation try networkx.convert.to_dict_of_dicts which will return a dictionary-of-dictionaries format that can be addressed as a sparse matrix."
The dictionary structure has M
dictionaries (= rows) with up to M
dictionaries nested in them. Note that the nested dictionaries are empty so checking for the existence of the key in them is equivalent to checking for a 1 or 0 as described above.
import time
import networkx as nx
def triads(m):
out = {0: set(), 1: set(), 2: set(), 3: set()}
nodes = list(m.keys())
for i, (n1, row) in enumerate(m.items()):
print(f"--> Row {i + 1} of {len(m.items())} <--")
# get all the connected nodes = existing keys
for n2 in row.keys():
# iterate over row of connected node
for n3 in m[n2]:
# n1 exists in this row, all 3 nodes are connected to each other = type 3
if n3 in row:
if len({n1, n2, n3}) == 3:
t = tuple(sorted((n1, n2, n3)))
out[3].add(t)
# n2 is connected to n1 and n3 but not n1 to n3 = type 2
else:
if len({n1, n2, n3}) == 3:
t = tuple(sorted((n1, n2, n3)))
out[2].add(t)
# n1 and n2 are connected, get all nodes not connected to either = type 1
for n3 in nodes:
if n3 not in row and n3 not in m[n2]:
if len({n1, n2, n3}) == 3:
t = tuple(sorted((n1, n2, n3)))
out[1].add(t)
for j, n2 in enumerate(nodes):
if n2 not in row:
# n2 not connected to n1
for n3 in nodes[j+1:]:
if n3 not in row and n3 not in m[n2]:
# n3 is not connected to n1 or n2 = type 0
if len({n1, n2, n3}) == 3:
t = tuple(sorted((n1, n2, n3)))
out[0].add(t)
return out
if __name__ == "__main__":
g = nx.Graph()
g.add_edges_from(
[("E", "D"), ("G", "F"), ("D", "B"), ("B", "A"), ("B", "C"), ("A", "C")]
)
_m = nx.convert.to_dict_of_dicts(g)
_out = triads(_m)
print(_out)
start = time.time()
g = nx.generators.fast_gnp_random_graph(1000, 0.1, seed=42)
_m = nx.convert.to_dict_of_dicts(g)
_out = triads(_m)
end = time.time() - start
print(end)
Let's check the numbers. Let n be the number of vertices, e the number of edges.
0 triads are in O(n^3)
1 triads are in O(e * n)
2 + 3 triads are in O(e)
To get the 2 + 3 triads:
For every node a:
For every neighbor of a b:
For every neighbor of b c:
if a and c are connected, [a b c] is a 3 triad
else [a b c] is a 2 triad
remove a from list of nodes (to avoid duplicate triads)
The next step depends on what the goal is. If you just need the number of 1 and 0 triads, then this is sufficient:
Explanation:
The 1 triads are all connected nodes + 1 unconnected node, so we get the number by computing the number of connected nodes + 1 other node, and subtract the cases where the other node is connected (2 and 3 triads)
The 0 triads is just all combinations of nodes minus the other triads.
If you need to actually list the triads, you are pretty much out of luck because no matter what you do, listing the 0 triads is in O(n^3) and will kill you once the graphs get bigger.
The above algo for 2 + 3 triads is in O(e * max(# neighbors)), the other parts are in O(e + n) for counting the nodes and edges. Much better than O (n^3) which you would need to explicitely list the 0 triads. Listing the 1 triads could still be done in O(e * n).