Threading in Python

It depends on what you're trying to do, but I'm partial to just using the threading module in the standard library because it makes it really easy to take any function and just run it in a separate thread.

from threading import Thread

def f():
    ...

def g(arg1, arg2, arg3=None):
    ....

Thread(target=f).start()
Thread(target=g, args=[5, 6], kwargs={"arg3": 12}).start()

And so on. I often have a producer/consumer setup using a synchronized queue provided by the Queue module

from Queue import Queue
from threading import Thread

q = Queue()
def consumer():
    while True:
        print sum(q.get())

def producer(data_source):
    for line in data_source:
        q.put( map(int, line.split()) )

Thread(target=producer, args=[SOME_INPUT_FILE_OR_SOMETHING]).start()
for i in range(10):
    Thread(target=consumer).start()

You've already gotten a fair variety of answers, from "fake threads" all the way to external frameworks, but I've seen nobody mention Queue.Queue -- the "secret sauce" of CPython threading.

To expand: as long as you don't need to overlap pure-Python CPU-heavy processing (in which case you need multiprocessing -- but it comes with its own Queue implementation, too, so you can with some needed cautions apply the general advice I'm giving;-), Python's built-in threading will do... but it will do it much better if you use it advisedly, e.g., as follows.

"Forget" shared memory, supposedly the main plus of threading vs multiprocessing -- it doesn't work well, it doesn't scale well, never has, never will. Use shared memory only for data structures that are set up once before you spawn sub-threads and never changed afterwards -- for everything else, make a single thread responsible for that resource, and communicate with that thread via Queue.

Devote a specialized thread to every resource you'd normally think to protect by locks: a mutable data structure or cohesive group thereof, a connection to an external process (a DB, an XMLRPC server, etc), an external file, etc, etc. Get a small thread pool going for general purpose tasks that don't have or need a dedicated resource of that kind -- don't spawn threads as and when needed, or the thread-switching overhead will overwhelm you.

Communication between two threads is always via Queue.Queue -- a form of message passing, the only sane foundation for multiprocessing (besides transactional-memory, which is promising but for which I know of no production-worthy implementations except In Haskell).

Each dedicated thread managing a single resource (or small cohesive set of resources) listens for requests on a specific Queue.Queue instance. Threads in a pool wait on a single shared Queue.Queue (Queue is solidly threadsafe and won't fail you in this).

Threads that just need to queue up a request on some queue (shared or dedicated) do so without waiting for results, and move on. Threads that eventually DO need a result or confirmation for a request queue a pair (request, receivingqueue) with an instance of Queue.Queue they just made, and eventually, when the response or confirmation is indispensable in order to proceed, they get (waiting) from their receivingqueue. Be sure you're ready to get error-responses as well as real responses or confirmations (Twisted's deferreds are great at organizing this kind of structured response, BTW!).

You can also use Queue to "park" instances of resources which can be used by any one thread but never be shared among multiple threads at one time (DB connections with some DBAPI compoents, cursors with others, etc) -- this lets you relax the dedicated-thread requirement in favor of more pooling (a pool thread that gets from the shared queue a request needing a queueable resource will get that resource from the apppropriate queue, waiting if necessary, etc etc).

Twisted is actually a good way to organize this minuet (or square dance as the case may be), not just thanks to deferreds but because of its sound, solid, highly scalable base architecture: you may arrange things to use threads or subprocesses only when truly warranted, while doing most things normally considered thread-worthy in a single event-driven thread.

But, I realize Twisted is not for everybody -- the "dedicate or pool resources, use Queue up the wazoo, never do anything needing a Lock or, Guido forbid, any synchronization procedure even more advanced, such as semaphore or condition" approach can still be used even if you just can't wrap your head around async event-driven methodologies, and will still deliver more reliability and performance than any other widely-applicable threading approach I've ever stumbled upon.


In order of increasing complexity:

Use the threading module

Pros:

  • It's really easy to run any function (any callable in fact) in its own thread.
  • Sharing data is if not easy (locks are never easy :), at least simple.

Cons:

  • As mentioned by Juergen Python threads cannot actually concurrently access state in the interpreter (there's one big lock, the infamous Global Interpreter Lock.) What that means in practice is that threads are useful for I/O bound tasks (networking, writing to disk, and so on), but not at all useful for doing concurrent computation.

Use the multiprocessing module

In the simple use case this looks exactly like using threading except each task is run in its own process not its own thread. (Almost literally: If you take Eli's example, and replace threading with multiprocessing, Thread, with Process, and Queue (the module) with multiprocessing.Queue, it should run just fine.)

Pros:

  • Actual concurrency for all tasks (no Global Interpreter Lock).
  • Scales to multiple processors, can even scale to multiple machines.

Cons:

  • Processes are slower than threads.
  • Data sharing between processes is trickier than with threads.
  • Memory is not implicitly shared. You either have to explicitly share it or you have to pickle variables and send them back and forth. This is safer, but harder. (If it matters increasingly the Python developers seem to be pushing people in this direction.)

Use an event model, such as Twisted

Pros:

  • You get extremely fine control over priority, over what executes when.

Cons:

  • Even with a good library, asynchronous programming is usually harder than threaded programming, hard both in terms of understanding what's supposed to happen and in terms of debugging what actually is happening.

In all cases I'm assuming you already understand many of the issues involved with multitasking, specifically the tricky issue of how to share data between tasks. If for some reason you don't know when and how to use locks and conditions you have to start with those. Multitasking code is full of subtleties and gotchas, and it's really best to have a good understanding of concepts before you start.