Optimal database design in terms of query speed to store matrices from R
I would strongly recommend using HDF5. I assume that your data is complex enough that a variety of bigmemory
files (i.e. memory mapped matrices) would not easily satisfy your needs (see note 1), but HDF5 is just short of the speed of memory mapped files. See this longer answer to another question to understand how I compare HDF5 and .RDat files.
Most notably, the fact that HDF5 supports random access means that you should be able to get substantial speed improvements.
Another option, depending on your willingness to design your own binary format, is to use readBin
and writeBin
, though this doesn't have all of the nice features that HDF5 has, including parallel I/O, version information, portability, etc.
Note 1: If you have just a few types per row, i.e. 1 character and the rest are numeric, you can simply create 2 memory mapped matrices, one of which is for characters, the other for numeric values. This will allow you to use bigmemory
, mwhich
, bigtabulate
and lots of other nice functions in the bigmemory
suite. I'd give that a reasonable effort, as it's a very easy system to smoothly integrate with lots of R code: the matrix need never enter memory, just whatever subsets you happen to need, and many instances can access the same files simultaneously. What's more, it is easy to parallelize access using multicore backends for foreach()
. I used to have an operation that would take about 3 minutes per .Rdat file: about 2 minutes to load, about 20 seconds to subselect what I needed, about 10 seconds to analyze, and about 30 seconds to save the results. After switching to bigmemory
, I got down to about 10 seconds to analyze and about 5-15 seconds on the I/O.
Update 1: I overlooked the ff package - this is another good option, though it is a lot more complex than bigmemory.
Maybe the database design of the TSdbi package is inspiring...
For a nosql solution, hdf5 might be an option. I do not know much about it, though.