Large (>22 trillion items) geospatial dataset with rapid (<1s) read query performance

How up-do-date do your read queries need to be?

You could partition the database by time if the map just needs to show the most recent measurement. This would reduce your query load for the map.

For the history of a given point, you could hold a second store by x and y showing the history. This could be done with a nightly refresh/update as the the historical data won't change.

Then you could pre-compute averages at more coarse resolutions for integrating with maps at different zoom levels. This would reduce the number of points to retrieve for large map areas (zoom out). Finer resolutions would be used for more zoomed in maps which were querying smaller areas. If you really need to speed this up you could compute tiles as blobs and interpret them in your application.

Because these would involve some re-computing of aggregate information there would be some latency in query results. Depending on how much latency was acceptable you could use this sort of approach to optimise your reads.

OK, so your points need to be computed averages over time. With this computation I guess your actual queries come down quite a lot from 22 trillion items as the raster values can be pre-calculated for querying.

You could shard by location. Partition the globe into a grid and have each square in that grid on one server. Since you mentioned cloud, that would be well suited to cloud. Of course your will need to manually merge the results from multiple servers.

That way you can use any database solution your like. It does not need to be scalable on its own.

The individual squares will have different amounts of data. You can use differently sized machines for them (since this is cloud), or you put multiple small shards on the same machine.

This sharding scheme is great for the kind of queries you perform because each query will only need to touch very few shards. Sharding by time is worse because all time shards must be touched for each query. Random sharding has the same problem.

All in all this is an easy sharding case because the query pattern fits the sharding scheme so well.

Actually, I wonder if you need a database at all for this. Maybe you can partition the globe into 1000x1000 tiles or smaller and have one flat file in blob storage for each tile. Blob storage does not mind 1M blobs at all.

Executing a query is conceptually very easy with this storage scheme. You can store the data redundantly in multiple grid resolutions as well.

It sounds like there are two classes of query - one to understand which locations lie within the current view window and a second to deliver the desired statistic for those points. My suggestion is to use separate, specialised tools for each.

I'm assuming all measurements relate to the same set of 75Bn points. These lat/longs, once established, are therefore static. They can be grouped, aggregated and indexed at a one-off cost. Therefore I would suggest sharding by region and zoom level. The size of each shard will be driven by the performance that can be achieved from each GIS instance.

The GIS will return a set of points that are passed to a time series database. This holds the measured values and performs aggregates. KDB is one I'm aware of. It targets securities trading, which will have fewer keys but more data points per key than your scenario.

There will be a cost to transferring the key values from the GIS server to the timeseries DB. My hypothesis is that this cost will be paid back by the faster processing in the task-specific timeseries DB. From the wording of the question it seems that a single instance will not be able to hold all data so some cross-server traffic seems inevitable. Given the relative speed of the components it seems likely sending a keyset to a remote server which has the data cached will be faster than reading the data off local disk.

If the point-finding and value-calculation parts can be local to each other then of course I would expect response to be faster. My (limited) understanding is that finding the N closest neighbours to a given point is a non-trivial task. This is why I suggested using specific software to perform it. If the point-finding can be reduced to

where latitude between x1 and x2
and logitude between y1 and y2

then that part could be handled by the value-storing software and the GIS eliminated from the architecture.

I have not implemented such a system. I'm really just thinking out loud here. At the petabyte scale there are no off-the-shelf solutions. There are, however, many satellite data providers so your problem is tractable. Good luck.

Large (>22 trillion items) geospatial dataset with rapid (<1s) read query performance

Tags:

Performance

Database Design

Spatial

Performance Tuning

Related

Recent Posts