Key: value store in Python for possibly 100 GB of data, without client/server
You can use sqlitedict which provides key-value interface to SQLite database.
SQLite limits page says that theoretical maximum is 140 TB depending on page_size
and max_page_count
. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3
2.6.0), are page_size=1024
and max_page_count=1073741823
. This gives ~1100 GB of maximal database size which fits your requirement.
You can use the package like:
from sqlitedict import SqliteDict
mydict = SqliteDict('./my_db.sqlite', autocommit=True)
mydict['some_key'] = any_picklable_object
print(mydict['some_key'])
for key, value in mydict.items():
print(key, value)
print(len(mydict))
mydict.close()
Update
About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size
pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:
pip install lipsum psutil matplotlib psrecord sqlitedict
sqlitedct.py
#!/usr/bin/env python3
import os
import random
from contextlib import closing
import lipsum
from sqlitedict import SqliteDict
def main():
with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:
for _ in range(100000):
v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
d[os.urandom(10)] = v
if __name__ == '__main__':
main()
Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!
. In my case it produces this chart:
And database file:
$ du -h my_db.sqlite
84M my_db.sqlite
LMDB (Lightning Memory-Mapped Database) is a very fast key-value store which has Python bindings and can handle huge database files easily.
There is also the lmdbm wrapper which offers the Pythonic d[key] = value
syntax.
By default it only supports byte values, but it can easily be extended to use a serializer (json, msgpack, pickle) for other kinds of values.
import json
from lmdbm import Lmdb
class JsonLmdb(Lmdb):
def _pre_key(self, value):
return value.encode("utf-8")
def _post_key(self, value):
return value.decode("utf-8")
def _pre_value(self, value):
return json.dumps(value).encode("utf-8")
def _post_value(self, value):
return json.loads(value.decode("utf-8"))
with JsonLmdb.open("test.db", "c") as db:
db["key"] = {"some": "object"}
obj = db["key"]
print(obj["some"]) # prints "object"
Some benchmarks. Batched inserts (1000 items each) were used for lmdbm and sqlitedict. Write performance suffers a lot for non-batched inserts for these because each insert opens a new transaction by default. dbm refers to stdlib dbm.dumb. Tested on Win 7, Python 3.8, SSD.
continuous writes in seconds
| items | lmdbm | pysos |sqlitedict| dbm |
|------:|------:|------:|---------:|--------:|
| 10| 0.0000| 0.0000| 0.01600| 0.01600|
| 100| 0.0000| 0.0000| 0.01600| 0.09300|
| 1000| 0.0320| 0.0460| 0.21900| 0.84200|
| 10000| 0.1560| 2.6210| 2.09100| 8.42400|
| 100000| 1.5130| 4.9140| 20.71700| 86.86200|
|1000000|18.1430|48.0950| 208.88600|878.16000|
random reads in seconds
| items | lmdbm | pysos |sqlitedict| dbm |
|------:|------:|------:|---------:|-------:|
| 10| 0.0000| 0.000| 0.0000| 0.0000|
| 100| 0.0000| 0.000| 0.0630| 0.0150|
| 1000| 0.0150| 0.016| 0.4990| 0.1720|
| 10000| 0.1720| 0.250| 4.2430| 1.7470|
| 100000| 1.7470| 3.588| 49.3120| 18.4240|
|1000000|17.8150| 38.454| 516.3170|196.8730|
For the benchmark script see https://github.com/Dobatymo/lmdb-python-dbm/blob/master/benchmark.py
The shelve module in the standard library does just that:
import shelve
with shelve.open('myfile.db') as d:
d['hello'] = 17 # Auto serializes any Python object with pickle
d[str(183)] = [12, 14, 24] # Keys, however, must be strings
d.sync() # Explicitly write to disc (automatically performed on close)
This uses the python dbm module to save and load data from disk without loading the entire thing.
Example with dbm:
import dbm, json
with dbm.open('myfile2.db', 'c') as d:
d['hello'] = str(17)
d[str(183)] = json.dumps([12, 14, 24])
d.sync()
However, there are two considerations when using shelve:
- It uses
pickle
for serialization. What this means is the data is coupled with Python and possibly the python version used to save the data. If this is a concern, thedbm
module can be used directly (same interface, but only strings can be used as keys/values). - The Windows implementation seems to have bad performance
For this reason, the following third party options copied from here would be good options:
- semidb - Faster cross platform
dbm
implementation - UnQLite - More feature-filled serverless database
- More mentioned in the link
I would consider HDF5 for this. It has several advantages:
- Usable from many programming languages.
- Usable from Python via the excellent h5py package.
- Battle tested, including with large data sets.
- Supports variable-length string values.
- Values are addressable by a filesystem-like "path" (
/foo/bar
). - Values can be arrays (and usually are), but do not have to be.
- Optional built-in compression.
- Optional "chunking" to allow writing chunks incrementally.
- Does not require loading the entire data set into memory at once.
It does have some disadvantages too:
- Extremely flexible, to the point of making it hard to define a single approach.
- Complex format, not feasible to use without the official HDF5 C library (but there are many wrappers, e.g.
h5py
). - Baroque C/C++ API (the Python one is not so).
- Little support for concurrent writers (or writer + readers). Writes might need to lock at a coarse granularity.
You can think of HDF5 as a way to store values (scalars or N-dimensional arrays) inside a hierarchy inside a single file (or indeed multiple such files). The biggest problem with just storing your values in a single disk file would be that you'd overwhelm some filesystems; you can think of HDF5 as a filesystem within a file which won't fall down when you put a million values in one "directory."