Graph Theory - How to find nodes reachable from a given node within certain cost?
I've been looking at "accessibility" or "reachability" but these seem to be wrong keywords.
Yes you are right. These are the wrong keywords.
First, let me redefine the problem more precisely. Given a graph G(V,E), a node s and a constant c, we want to find the set R = { (n, d) | the distance between s and n is d <= c}.
This problem is a generalized version of the reachability problem in which c is infinity and is much more difficult considering large graphs.
Now as part of the precomputation, to find the set R you will have to determine the length of the shortest path between s and all other nodes. This is the All Pairs Shortest Path (APSP) problem in disguise.
So the first step, the precomputation, is to search through research libraries for a really fast APSP algorithm that suits the type of graphs your are dealing with. The algorithm and its implementation determine the precomputation running time.
The second is step is to find a way store as much as possible from the results of the algorithm and as efficiently as possibly because the data structures and algorithms you choose here will determine the running time of the query. Considering a billion vertices, the number of shortest paths computed by the algorithm would be around 10^18 (from, to, distance) triplets. If the size of each value is 4 bytes then you will need around 7 exabytes if we store all of this data in a distributed hash table (which requires additional storage). In this case, we can achieve a query time of at most few milliseconds.
Otherwise, if you cannot store all of this data, you will have to either compress the data and/or discard some of it. That's a different problem now. You might want to partition the graph into many subgraphs of small diameter (the diameter has to be determined experimentally) and then store shortest path information only for the nodes at the centers of the subgraphs and at query time you will have to rerun a very efficient implementation of SSSP (single source shortest path). There are many other optimizations techniques that can easily span a book. Whatever you do, achieving <200ms query time is very challenging.
Caching query results in DRAM and local disks is a great idea. This can help a lot if a large percentage of the queries are the same.
Regarding the number of users, since the graph is static, you can parallelize all the queries. You can take advantage of highly parallel CPUs and GPUs. >10000 queries in parallel is not trivial, but you can take advantage of queries that are close in the graph and possibly merging multiple slightly different queries into one and later filtering out the results.
Finally, the code you write to process queries itself can be optimized. A deep understanding of compiler optimizations and computer architecture can help in reducing query time substantially.
Shortest-path distances in a weighted graph give the graph the structure of a metric space. In metric space terminology you are trying to find the closed ball of radius c centered at s. Perhaps there is some research on treating graphs as computationally tractable discrete metric spaces. The notion of eccentricity can also be brought into play - where the eccentricity of a node is the maximum distance from that point to any other point. It seems like you are trying to find the maximal subgraph with the property that the eccentricity of s in the subgraph is bounded by c.
Dijkstra's algorithm can clearly be modified to give what you are looking for. If at any point Dijkstra's algorithm would have you include a vertex in the set of visited nodes (the nodes for which the final distance is known) but with a resulting distance which exceeds c, throw that node away rather than add it to the list of visited nodes. This will in effect prune the tree of nodes reachable from s. Should be reasonably efficient.
Since you are looking for optimizations from pre-computed values and temporary results, I would suggest looking at something like the Floyd-Warshall algorithm, which is used to find all-pairs shortest path. Once this is computed once, starting from any other node can be done quickly from the first results, so this would give you all of the temporary results you would need.
Unfortunately the algorithm is O(V^3) with regard to time complexity and O(V^2) in space complexity, so it is fairly expensive. Given that your example problem sounds to be quite large, regarding the number of nodes, you probably won't want to run that algorithm on the entire set of data, unless you have tons of memory you want to use by storing the pre-computed result of this.
Depending on the types of searches people are doing, you could break the road network up into smaller sections and and then run the Floyd-Warshall algorithm on the smaller data set so you don't have to store the entire result all the time. This would only work if people are searching for small distances, as in your example of places 1 hour away. If people were searching for anything within 1000 miles, breaking it up into pieces wouldn't help, and computing this wouldn't be able to be done in real-time within the time restrictions you provided. As efficient as you might make this, searches for everything in a massive radius like that is going to take significant time to compute (likely much more than your time constraint).
Realistically, the best solution is going to depend on how people use it and how spread out the searches are.
Even if people are searching for things more than 100 miles away, it is likely that highways are going to be the most efficient route, at least for the majority of the trip, so probably could try some optimization for long distances by ignoring smaller roads except for nearby the start and end of the trip, which would significantly reduce the number of nodes for long distances, making the computational time and memory consumed by the Floyd-Warshall to be much more reasonable. Unfortunately that won't help you because you still will need to find all the nodes less than the given distance.
If you are concerned about the computation speed and strain on your server, you might need to restrict how large of distances you are going to accept.
The correct terminology to the problem you are dealing with is in the family of "shortest path algorithms". Reachability problems (i.e. Warshal) deal with the question "is there a path between nodes A and B?" and it has a binary answer, you don't need weights in this case, you only look for edges. But in your case you need to take into consideration the time it takes to travel from node A to node B in every case.
Now, there is no "exact" fit for this problems because a small change on Dijktra's, Floyd's, BFS or DFS can be used to solve this problem, but you have an added complexity because of the size of the graph this is important to understand how to build your solution. Which algorithm to use depends on your specific combination of time constraints and available hardware you have.
There are 2 algorithmic solutions to your problem (I'm assuming you already have all the edges stored somewhere and that you can query this database somehow):
In an ideal (imaginary) world you would run a Floyd's algorithm, save the resulting matrix in something like Redis and that's it, you can serve your requests in less than 10 ms, if the number of clients grow you can add more redis servers as needed since the graph is not likely to change very often (because in your specific problem you have road information). The problem with this is the space complexity of the solution, the best thing about it is that everything is precomputed, which means your response time to requests is minimal. To be able to implement some variation of this you need a lot of space, even a redis cluster with a DB stored on disk (yes disk, not memory) and all the servers with SSDs should be enough, this option scales well when the number of concurrent clients grow and the time response is quite fast. But whether or not you can use this solution depends on the number of nodes in your graph: even tho you need to precompute the distances using each edge you only need space to store a matrix of NxN being N the number of nodes in your graph, if this matrix fits in redis then you can use this algorithm and precompute all the "distances" (in your case this is the sum of the costs a.k.a "travel times") between all the nodes. Later on when you receive a request you need to query the resulting matrix to fetch all the nodes with a travel time less than a given value, there is an additional optimization when storing this matrix in redis than can allow you to fetch the node numbers quite fast using sorted sets.
Then you have the second solution which is to modify Dijktra, BFS or DFS to prune the search based on the cost and that's it. In this scenario you have already discarded Floyd because of the high space complexity, which means that your graph is rather huge both in nodes and edges. In this case the solution is almost the same using a variation of any of the algorithms, which makes the difference is how are you storing the graph. All 3 algorithms can be modified to efficiently respond in the time you want but to support over 10k clients the database you use to store the graph makes the difference. And here you have another 2 options:
- Using standard SQL/NoSQL database: This database should be sharded somehow since it should serve your code servers (plural, because of the C10K problem) running the searches on the graph. I would suggest you to continue your research in this area depending on the size of the graph data itself, but if you manage to put all the data in a Cassandra cluster or a similar technology it can be optimized to the response times you want and it scales really well.
- You make use of the fact that the graph is actually a "map" and you find a way of running distance queries on the data: This potentially requires you to change the way you store the graph to add something like latitude and longitude. So instead of running the algorithm against the full graph, which is absurd, you optimize the whole process by using the fact that given a certain amount of minutes from a given node, you can translate that to a distance D in miles (approximate, you can add something like 10-20 miles to be on the safe side) and you run a query on the database to fetch all the nodes up to that distance D and then you run the algorithm of your choice against that small graph. It is important to note that the graph you pull from the database using this method already gives you an approximation of the solution if the actual travel time in the edges is somewhat proportional to the distance traveled (this is not always true, that's why it is an approximation). To implement this at a small scale you would use PostgreSQL + PostGIS which allows you to run such queries, but you are going to need to do some research here to try to find the solution that scales/works the best for you.
Hope it helps!