Can MySQL reasonably perform queries on billions of rows?
I am not very familiar with your needs, but perhaps storing each data point in the database is a bit of overkill. It sound almost like taking the approach of storing an image library by storing each pixel as a separate record in a relational database.
As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.
Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows. This was using MySQL 5.0, so it's possible that things may have improved.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though. (If you want six sigma-level availability with a terabyte of data, don't use MySQL. We were a startup that had no DBA and limited funds.)
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql 5.0 stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
Trying to do analysis on 200 billion rows of data would require very high-end hardware and a lot of hand-holding and patience. Just keeping the data backed up in a format that you could restore from would be a significant job.
I agree with srini.venigalla's answer that normalizing the data like crazy may not be a good idea here. Doing joins across multiple tables with that much data will open you up to the risk of file sorts which could mean some of your queries would just never come back. Denormallizing with simple, integer keys would give you a better chance of success.
Everything we had was InnoDB. Regarding MyISAM vs. InnoDB: The main thing would be to not mix the two. You can't really optimize a server for both because of the way MySQL caches keys and other data. Pick one or the other for all the tables in a server if you can. MyISAM may help with some speed issues, but it may not help with the overall DBA work that needs to be done - which can be a killer.
normalizing the data like crazy
Normalizing the data like crazy may not be the right strategy in this case. Keep your options open by storing the data both in the Normalized form and also in the form of materialized views highly suited to your application. Key in this type of applications is NOT writing adhoc queries. Query modeling is more important than data modeling. Start with your target queries and work towards the optimum data model.
Is this reasonable?
I would also create an additional flat table with all data.
run_id | spectrum_id | data_id | <data table columns..> |
I will use this table as the primary source of all queries. The reason is to avoid having to do any joins. Joins without indexing will make your system very unusable, and having indexes on such huge files will be equally terrible.
Strategy is, query on the above table first, dump the results into a temp table and join the temp table with the look up tables of Run and Spectrum and get the data you want.
Have you analyzed your Write needs vs Read needs? It will be very tempting to ditch SQL and go to non-standard data storage mechanisms. In my view, it should be the last resort.
To accelerate the write speeds, you may want to try the Handler Socket method. Percona, if I remember, packages Handler Socket in their install package. (no relation to Percona!)
http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html