Storing very large graphs on disk/streaming graph partitioning algorithms?

You might want to look at HDF5. Despite the H standing for Hierarchical it can store graphs, check the documentation under the keyword 'Groups', and it is designed for very large datasets. If I understand correctly HDF5 'files' can be spread across multiple o/s 'files'. Now, HDF5 is only a data structure, plus a set of libraries for low- and high-level manipulations of the data structure. Off the top of my head I haven't a clue about streaming graph-partitioning algorithms, but I stick to the notion that if you get the data structure right algorithms become easier to implement.

What do you already know about the mega-graph ? Does it naturally partition into dense subgraphs which themselves are sparsely connected ? Would a topological sort of the graph be a better basis for storage on disk than the existing spatial sort?

Failing crisp answers to such questions, maybe you just have to bite the bullet and read the graph multiple times to build the partitions, in which case you just want the fastest I/O you can manage, and sophisticated layout of partitions on nodes is nice but not as important. If you can partition the graph into sub-graphs which themselves have single edges to the other sub-graphs you maybe able to make the problem more tractable.

You want a good-for-BFS layout, but BFS is usually applied to trees. Does your graph have a unique root from which to start all BFSes? If not, then layout for BFS from one vertex will be suboptimal for BFS from another vertex.


No algorithm truly needs to "fit into memory"--you can always page things in and out as needed. But you do want to avoid having the computation take unreasonably long--and global graph partitioning in the generic case is a NP-complete problem, which is "unreasonably long" for most problems that do not even fit in memory.

Fortunately, you want to do breadth-first searches, which means that you want a format where breadth-first is the easy computation. I don't know of any algorithms offhand that do this, but you can construct your own breadth-first layout if you're willing to allow a bit of extra disk space.

If the edges are not biased towards local interactions, then disentangling the graph will be difficult. If they are biased towards local interactions, then I suggest an algorithm like the following:

  • Pick a random set of vertices as starting points from throughout the entire data set.
  • For each vertex, collect all neighboring vertices (takes one sweep through the data set).
  • For each set of neighboring vertices collect the set of neighbors-of-neighbors and rank them according to how many edges connect to them. If you don't have space in a page to store them all, keep the most-connected vertices. If you do have space to save them all, you may wish to throw away the least useful ones (e.g. if the fraction of edges kept within a page / fraction of vertices needing storage ratio drops "too low"--where "too low" will depend on how much breadth your searches really need, and whether you can do any pruning and so on--then don't include those in the neighborhood.
  • Repeat the process of collecting and ranking neighbors until your neighborhood is full (e.g. fills some page size that suits you). Then check for repeats among the randomly chosen starts. If you have a small number of vertices appearing in both, remove them from one or the other, whichever breaks fewer edges. If you have a large number of vertices appearing in both, keep the neighborhood with the best (vertices in neighborhood/broken edge) ratio and throw the other away.

Now you have some local neighborhoods that are approximately locally optimal in that breadth-first searches tend to fall inside. If your breadth-first search prunes off unproductive branches pretty effectively, then this is probably good enough. If not, you probably want adjacent neighborhoods to cluster.

If you don't need adjacent neighborhoods to cluster too much, you set aside the vertices you've grouped into neighborhoods, and repeat the process on the remaining data until all vertices are accounted for. You change each vertex identifier to (vertex,neighborhood), and you're done: when following edges, you know exactly which page to grab, and most of them will be close by given the construction.

If you do need adjacent neighborhoods, then you'll need to keep track of your growing neighborhoods. You repeat the previous process (pick at random, grow neighborhoods), but now rank neighbors by both how many edges they satisfy within the neighborhood and what fraction of their edges that leave the neighborhood are in an existing group. You might need weighting factors, but something like

score = (# edges within) - (# neighborhoods outside) - (# neighborhoodless edges outside)

would probably do the trick.

Now, this is not globally or even locally optimal, but this or something very much like it should give a nicely locally-connected structure, and should let you produce a covering set of neighborhoods that have relatively high interconnectivity.

Again, it depends whether your breadth-first search prunes branches or not. If it does, the inexpensive thing to do is to maximize local interconnectivity. If it doesn't the thing to do is to minimize external connectivity--and in that case, I'd suggest just collecting breadth-first sets up to some size and saving those (with duplication at the edges of the sets--you're not badly limited by hard drive space, are you?).