Standard key/value datastore for unix

I don't think there is a standard tool for that. Except for grep/awk/sed etc. But using this you will need to care about lot of other issues like locking, format, special characters, etc.

I suggest to use sqlite. Define a simple table and then create tool_get() and tool_put() shell functions. sqlite is portable, fast.

You will get extra flexibility for free. You can define constrains, index to tweak your script or use that DB in other languages some day.


If your database is small enough, then you can use the filesystem. The advantage of this approach is that it's very low-tech, and will work everywhere with very little code. If the keys are composed of printable characters and do not contain /, then you can use them as file names:

put () { key=$1; value=$2; printf %s "$value" >"datastore.db/$key"; }
get () { key=$1; cat "datastore.db/$key"; }
remove () { key=$1; rm "datastore.db/$key"; }

To accommodate arbitrary keys, use a checksum of the key as the file name, and optionally store a copy of the key (unless you're happy with not being able to list the keys or tell what the key is for a given entry).

put () {
  key=$1; value=$2; set $(printf %s "$key" | sha1sum); sum=$1
  printf %s "$key" >"datastore.db/$sum.key"
  printf %s "$value" >"datastore.db/$sum.value"
}
get () {
  key=$1; set $(printf %s "$key" | sha1sum); sum=$1
  cat "datastore.db/$1.value"
}
remove () {
  key=$1; set $(printf %s "$key" | sha1sum); sum=$1
  rm "datastore.db/$1.key" "datastore.db/$1.value"
}

Note that the toy implementations above are not the whole story: they do not have any useful transactional property such as atomicity. The basic filesystem operations such as file creation and renaming are atomic however, and it is possible to build atomic versions of the functions above.

These direct-to-filesystem implementations are suitable with typical filesystems only for small databases, up to a few thousand files. Beyond this point, most filesystems have a hard time coping with large directories. You can adapt the scheme to larger databases by using a layered layout. For example, instead of storing all the files in one directory, store them in separate subdirectories based on the first few characters of their names. This is what git does, for example: its objects, indexed by SHA-1 hashes, are stored in files called .git/objects/01/2345679abcdef0123456789abcdef01234567. Other examples of programs that use a semantic layering are the web caching proxies Wwwoffle and polipo; both store the cached copy of a page found at a URL in a file called www.example.com/HASH where HASH is some encoding of some hash of the URL.¹

Another source of inefficiency is that most filesystems waste a lot of space when storing small files — there is a waste of up to 2kB per file on typical filesystems, independently of the size of the file.

If you choose to go with a real database, you don't need to forego the convenience of transparent filesystem access. There are several FUSE filesystems to access databases including Berkeley DB (with Jeff Garzik's dbfs), Oracle (with Oracle DBFS), MySQL (with mysqlfs), etc.

¹ For an URL like http://unix.stackexchange.com/questions/21943/standard-key-value-datastore-for-unix, Polipo uses the file unix.stackexchange.com/M0pPbpRufiErf4DLFcWlhw==, with an added header inside the file indicating the actual URL in clear text; the file name is the base64 encoding of the MD5 hash (in binary) of the URL. Wwwoffle uses the file http/unix.stackexchange.com/DM0pPbpRufiErf4DLFcWlhw; the name of the file is a home-grown encoding of the MD5 hash, and a companion file http/unix.stackexchange.com/UM0pPbpRufiErf4DLFcWlhw contains the URL.


dbmutil might get you what you want. It has shell utilities for the operations you describe in the question. I wouldn't say it's exactly standard, but it does have the facilities you want.

Tags:

Database