Using geohash for proximity searches?
Absolutely you can. And it can be quite fast. (The intensive computation bits can ALSO be distributed)
There are several ways, but one way that I've been working with is in using an ordered list of integer-based geohashes, and finding all the nearest neighbour geohash ranges for a specific geohash resolution (the resolution approximates your distance
criteria), and then querying those geohash ranges to get a list of nearby points. I use redis and nodejs (ie. javascript) for this. Redis is super fast and can retrieve ordered ranges very quickly, but it can't do a lot of the indexing query manipulation stuff that SQL databases can do.
The method is outlined here: https://github.com/yinqiwen/ardb/wiki/Spatial-Index
But the gist of it is (to paraphrase the link):
- You store all your geohashed points in the best resolution you want (max usually 64bit integer if that's accessible, or in the case of javascript, 52bits) in an ordered set (ie. zset in redis). Most geohash libraries these days have geohash integer functions built in, and you'll need to use these instead of the more common base32 geohashes.
- Based on the radius you want to search within, you need to then find a bit depth/resolution that will match your search area and this must be less than or equal to your stored geohash bit depth. The linked site has a table that correlates the bit depth of a geohash to its bounding box area in meters.
- Then you rehash your original coordinate at this lower resolution.
- At that lower resolution also find the 8 neighbour (n, ne, e, se, s, sw, w, nw) geohash areas. The reason why you have to do the neighbour method, is because two coordinates nearly right beside each other could have completely different geohashes, so you need to do some averaging of the area covered by the search.
- Once you get all the neighbour geohashes at this lower resolution, add to the list your coordinate's geohash from step 3.
- Then you need to build a range of geohash values to search within which cover these 9 areas. The values from step 5 are your lower range limit, and if you add 1 to each of them, you'll get your upper range limit. So you should have an array of 9 ranges, each with a lower limit and and upper geohash limit (18 geohashes in total). These geohashes are still in that lower resolution from step 2.
- Then you convert all 18 of these geohashes to whatever bit depth/resolution you have stored all your geohashes in your database in. Generally you do this by bitshifting it to the desired bit depth.
- Now you can do a range query for points within these 9 ranges and you'll get all points approximately within the distance of your original point. There will be no overlap so you don't need to do any intersections, just pure range queries, very fast. (ie. in redis: ZRANGEBYSCORE zsetname lowerLimit upperLimit, over the 9 ranges produced in this step)
You can further optimize (speed wise) this by:
- Taking those 9 ranges from step 6 and finding where they lead into each other. Usually you can reduce 9 separate ranges into about 4 or 5 depending on where your coordinate is. This can reduce your query time by half.
- Once you have your final ranges, you should hold them for reuse. The calculation of these ranges can take most of the processing time, so if your original coordinate doesn't change much but you need to make the same distance query over again, you should keep that ready instead of calculating it everytime.
- If you're using redis, try to combine the queries into a MULTI/EXEC so it pipelines them for a bit better performance.
- The BEST part: You can distribute steps 2-7 on clients instead of having that computation done all in one place. This greatly reduces CPU load in situations where millions of requests would be coming in.
You can further improve accuracy by using a circle distance/haversine type function on the returned results if you care much about precision.
Here's a similar technique using ordinary base32 geohashes and a SQL query instead of redis: https://github.com/davetroy/geohash-js
I don't mean to plug my own thing, but I've written a module for nodejs&redis that makes this really easy to implement. Have a look at the code if you'd like: https://github.com/arjunmehta/node-georedis
The question could be read in several ways. I interpret it to mean you have a large number of points and you intend to probe them repeatedly with arbitrary points, given as coordinate pairs, and wish to obtain the n nearest points to the probe, with n fixed beforehand. (In principle, if n will vary, you could set up a data structure for every possible n and select it in O(1) time with each probe: this could take a very long setup time and require a lot of RAM, but we are told to ignore such concerns.)
Build the order-n Voronoi diagram of all the points. This partitions the plane into connected regions, each of which has the same n neighbors. This reduces the situation to the point-in-polygon problem, which has many efficient solutions.
Using a vector data structure for the Voronoi diagram, point-in-polygon searches will take O(log(n)) time. For practical purposes you can make this O(1) with an extremely small implicit coefficient simply by creating a raster version of the diagram. The values of the cells in the raster are either (i) a pointer to a list of the n nearest points or (ii) an indication that this cell straddles two or more regions in the diagram. The test for an arbitrary point at (x,y) becomes:
Fetch the cell value for (x,y).
If the value is a list of points, return it.
Else apply a vector point-in-polygon algorithm to (x,y).
To achieve O(1) performance, the raster mesh has to be sufficiently fine that relatively few probe points will fall in cells that straddle multiple Voronoi regions. This can always be accomplished, with a potentially great expense in storage for the grids.
I use geohashes for exactly this. The reason I am is because I needed to implement proximity searches using a pyramid style information system.. where geohashes with an 8th level precision were the 'base' and formed new totals for geohashes of the 7th precision.. and so on and so forth. These totals were area, types of ground cover, etc.. It was a very fancy way to do some very fancy stuff.
So 8th level geohashes would contain information like:
type: grass acres: 1.23
and 7th, 6th.. etc.. would contain information like:
grass_types: 123 acres: 6502
This was always built up from the lowest precision. This allowed me to do all sorts of fun statistics very quickly. I was also able to assign a geometry reference to each geohash reference using GeoJSON.
I was able to write several functions to find the largest geohashes that make up my current viewport and then use those to find geohashes of the the second largest precision within the viewport. This could easily be extended to indexed range queries where I would query for a minimum of '86ssaaaa' and a maximum of '86sszzzz' for whatever precision I wanted.
I'm doing this using MongoDB.