Key: value store in Python for possibly 100 GB of data, without client/server

You can use sqlitedict which provides key-value interface to SQLite database.

SQLite limits page says that theoretical maximum is 140 TB depending on page_size and max_page_count. However, default values for Python 3.5.2-2ubuntu0~16.04.4 (sqlite3 2.6.0), are page_size=1024 and max_page_count=1073741823. This gives ~1100 GB of maximal database size which fits your requirement.

You can use the package like:

from sqlitedict import SqliteDict

mydict = SqliteDict('./my_db.sqlite', autocommit=True)
mydict['some_key'] = any_picklable_object
print(mydict['some_key'])
for key, value in mydict.items():
    print(key, value)
print(len(mydict))
mydict.close()

Update

About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MiB (the same Python as above). Here's the script you can use to check it with your data. Before run:

pip install lipsum psutil matplotlib psrecord sqlitedict

sqlitedct.py

#!/usr/bin/env python3

import os
import random
from contextlib import closing

import lipsum
from sqlitedict import SqliteDict

def main():
    with closing(SqliteDict('./my_db.sqlite', autocommit=True)) as d:
        for _ in range(100000):
            v = lipsum.generate_paragraphs(2)[0:random.randint(200, 1000)]
            d[os.urandom(10)] = v

if __name__ == '__main__':
    main()

Run it like ./sqlitedct.py & psrecord --plot=plot.png --interval=0.1 $!. In my case it produces this chart: chart

And database file:

$ du -h my_db.sqlite 
84M my_db.sqlite

LMDB (Lightning Memory-Mapped Database) is a very fast key-value store which has Python bindings and can handle huge database files easily.

There is also the lmdbm wrapper which offers the Pythonic d[key] = value syntax.

By default it only supports byte values, but it can easily be extended to use a serializer (json, msgpack, pickle) for other kinds of values.

import json
from lmdbm import Lmdb

class JsonLmdb(Lmdb):
  def _pre_key(self, value):
    return value.encode("utf-8")
  def _post_key(self, value):
    return value.decode("utf-8")
  def _pre_value(self, value):
    return json.dumps(value).encode("utf-8")
  def _post_value(self, value):
    return json.loads(value.decode("utf-8"))

with JsonLmdb.open("test.db", "c") as db:
  db["key"] = {"some": "object"}
  obj = db["key"]
  print(obj["some"])  # prints "object"

Some benchmarks. Batched inserts (1000 items each) were used for lmdbm and sqlitedict. Write performance suffers a lot for non-batched inserts for these because each insert opens a new transaction by default. dbm refers to stdlib dbm.dumb. Tested on Win 7, Python 3.8, SSD.

continuous writes in seconds

| items | lmdbm | pysos |sqlitedict|   dbm   |
|------:|------:|------:|---------:|--------:|
|     10| 0.0000| 0.0000|   0.01600|  0.01600|
|    100| 0.0000| 0.0000|   0.01600|  0.09300|
|   1000| 0.0320| 0.0460|   0.21900|  0.84200|
|  10000| 0.1560| 2.6210|   2.09100|  8.42400|
| 100000| 1.5130| 4.9140|  20.71700| 86.86200|
|1000000|18.1430|48.0950| 208.88600|878.16000|

random reads in seconds

| items | lmdbm | pysos |sqlitedict|  dbm   |
|------:|------:|------:|---------:|-------:|
|     10| 0.0000|  0.000|    0.0000|  0.0000|
|    100| 0.0000|  0.000|    0.0630|  0.0150|
|   1000| 0.0150|  0.016|    0.4990|  0.1720|
|  10000| 0.1720|  0.250|    4.2430|  1.7470|
| 100000| 1.7470|  3.588|   49.3120| 18.4240|
|1000000|17.8150| 38.454|  516.3170|196.8730|

For the benchmark script see https://github.com/Dobatymo/lmdb-python-dbm/blob/master/benchmark.py


The shelve module in the standard library does just that:

import shelve
with shelve.open('myfile.db') as d:
    d['hello'] = 17  # Auto serializes any Python object with pickle
    d[str(183)] = [12, 14, 24]  # Keys, however, must be strings
    d.sync()  # Explicitly write to disc (automatically performed on close)

This uses the python dbm module to save and load data from disk without loading the entire thing.

Example with dbm:

import dbm, json
with dbm.open('myfile2.db', 'c') as d:
    d['hello'] = str(17)
    d[str(183)] = json.dumps([12, 14, 24])
    d.sync()

However, there are two considerations when using shelve:

  • It uses pickle for serialization. What this means is the data is coupled with Python and possibly the python version used to save the data. If this is a concern, the dbm module can be used directly (same interface, but only strings can be used as keys/values).
  • The Windows implementation seems to have bad performance

For this reason, the following third party options copied from here would be good options:

  • semidb - Faster cross platform dbm implementation
  • UnQLite - More feature-filled serverless database
  • More mentioned in the link

I would consider HDF5 for this. It has several advantages:

  • Usable from many programming languages.
  • Usable from Python via the excellent h5py package.
  • Battle tested, including with large data sets.
  • Supports variable-length string values.
  • Values are addressable by a filesystem-like "path" (/foo/bar).
  • Values can be arrays (and usually are), but do not have to be.
  • Optional built-in compression.
  • Optional "chunking" to allow writing chunks incrementally.
  • Does not require loading the entire data set into memory at once.

It does have some disadvantages too:

  • Extremely flexible, to the point of making it hard to define a single approach.
  • Complex format, not feasible to use without the official HDF5 C library (but there are many wrappers, e.g. h5py).
  • Baroque C/C++ API (the Python one is not so).
  • Little support for concurrent writers (or writer + readers). Writes might need to lock at a coarse granularity.

You can think of HDF5 as a way to store values (scalars or N-dimensional arrays) inside a hierarchy inside a single file (or indeed multiple such files). The biggest problem with just storing your values in a single disk file would be that you'd overwhelm some filesystems; you can think of HDF5 as a filesystem within a file which won't fall down when you put a million values in one "directory."