python threadsafe object cache

Point 1. GIL does not help you here, an example of a (non-thread-safe) cache for something called "stubs" would be

stubs = {}

def maybe_new_stub(host):
    """ returns stub from cache and populates the stubs cache if new is created """
    if host not in stubs:
        stub = create_new_stub_for_host(host)
        stubs[host] = stub
    return stubs[host]

What can happen is that Thread 1 calls maybe_new_stub('localhost'), and it discovers we do not have that key in the cache yet. Now we switch to Thread 2, which calls the same maybe_new_stub('localhost'), and it also learns the key is not present. Consequently, both threads call create_new_stub_for_host and put it into the cache.

The map itself is protected by the GIL, so we cannot break it by concurrent access. The logic of the cache, however, is not protected, and so we may end up creating two or more stubs, and dropping all except one on the floor.

Point 2. Depending on the nature of the program, you may not want a global cache. Such shared cache forces synchronization between all your threads. For performance reasons, it is good to make the threads as independent as possible. I believe I do need it, you may actually not.

Point 3. You may use a simple lock. I took inspiration from https://codereview.stackexchange.com/questions/160277/implementing-a-thread-safe-lrucache and came up with the following, which I believe is safe to use for my purposes

import threading

stubs = {}
lock = threading.Lock()


def maybe_new_stub(host):
    """ returns stub from cache and populates the stubs cache if new is created """
    with lock:
        if host not in stubs:
            channel = grpc.insecure_channel('%s:6666' % host)
            stub = cli_pb2_grpc.BrkStub(channel)
            stubs[host] = stub
        return stubs[host]

Point 4. It would be best to use existing library. I haven't found any I am prepared to vouch for yet.


You probably want to use memcached instead. It's very fast, very stable, very popular, has good python libraries, and will allow you to grow to a distributed cache should you need to:

http://www.danga.com/memcached/


Thread per request is often a bad idea. If your server experiences huge spikes in load it will take the box to its knees. Consider using a thread pool that can grow to a limited size during peak usage and shrink to a smaller size when load is light.


Well a lot of operations in Python are thread-safe by default, so a standard dictionary should be ok (at least in certain respects). This is mostly due to the GIL, which will help avoid some of the more serious threading issues.

There's a list here: http://coreygoldberg.blogspot.com/2008/09/python-thread-synchronization-and.html that might be useful.

Though atomic nature of those operation just means that you won't have an entirely inconsistent state if you have two threads accessing a dictionary at the same time. So you wouldn't have a corrupted value. However you would (as with most multi-threading programming) not be able to rely on the specific order of those atomic operations.

So to cut a long story short...

If you have fairly simple requirements and aren't to bothered about the ordering of what get written into the cache then you can use a dictionary and know that you'll always get a consistent/not-corrupted value (it just might be out of date).

If you want to ensure that things are a bit more consistent with regard to reading and writing then you might want to look at Django's local memory cache:

http://code.djangoproject.com/browser/django/trunk/django/core/cache/backends/locmem.py

Which uses a read/write lock for locking.