Parallel Processing in python
mincemeat
is the simplest map/reduce implementation that I've found. Also, it's very light on dependencies - it's a single file and does everything with standard library.
I agree that using Pool
from multiprocessing
is probably the best route if you want to stay within the standard library. If you are interested in doing other types of parallel processing, but not learning anything new (i.e. still using the same interface as multiprocessing
), then you could try pathos
, which provides several forms of parallel maps and has pretty much the same interface as multiprocessing
does.
Python 2.7.6 (default, Nov 12 2013, 13:26:39)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numToFactor = 976
>>> def isFactor(x):
... result = None
... div = (numToFactor / x)
... if div*x == numToFactor:
... result = (x,div)
... return result
...
>>> from pathos.multiprocessing import ProcessingPool as MPool
>>> p = MPool(4)
>>> possible = range(1,int(numpy.floor(numpy.sqrt(numToFactor)))+1)
>>> # standard blocking map
>>> result = [x for x in p.map(isFactor, possible) if x is not None]
>>> print result
[(1, 976), (2, 488), (4, 244), (8, 122), (16, 61)]
>>>
>>> # asynchronous map (there's also iterative maps too)
>>> obj = p.amap(isFactor, possible)
>>> obj
<processing.pool.MapResult object at 0x108efc450>
>>> print [x for x in obj.get() if x is not None]
[(1, 976), (2, 488), (4, 244), (8, 122), (16, 61)]
>>>
>>> # there's also parallel-python maps (blocking, iterative, and async)
>>> from pathos.pp import ParallelPythonPool as PPool
>>> q = PPool(4)
>>> result = [x for x in q.map(isFactor, possible) if x is not None]
>>> print result
[(1, 976), (2, 488), (4, 244), (8, 122), (16, 61)]
Also, pathos
has a sister package with the same interface, called pyina
, which runs mpi4py
, but provides it with parallel maps that run in MPI and can be run using several schedulers.
One other advantage is that pathos
comes with a much better serializer than you can get in standard python, so it's much more capable than multiprocessing
at serializing a range of functions and other things. And you can do everything from the interpreter.
>>> class Foo(object):
... b = 1
... def factory(self, a):
... def _square(x):
... return a*x**2 + self.b
... return _square
...
>>> f = Foo()
>>> f.b = 100
>>> g = f.factory(-1)
>>> p.map(g, range(10))
[100, 99, 96, 91, 84, 75, 64, 51, 36, 19]
>>>
Get the code here: https://github.com/uqfoundation
A good simple way to start with parallel processing in python is just the pool mapping in mutiprocessing -- its like the usual python maps but individual function calls are spread out over the different number of processes.
Factoring is a nice example of this - you can brute-force check all the divisions spreading out over all available tasks:
from multiprocessing import Pool
import numpy
numToFactor = 976
def isFactor(x):
result = None
div = (numToFactor / x)
if div*x == numToFactor:
result = (x,div)
return result
if __name__ == '__main__':
pool = Pool(processes=4)
possibleFactors = range(1,int(numpy.floor(numpy.sqrt(numToFactor)))+1)
print 'Checking ', possibleFactors
result = pool.map(isFactor, possibleFactors)
cleaned = [x for x in result if not x is None]
print 'Factors are', cleaned
This gives me
Checking [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Factors are [(1, 976), (2, 488), (4, 244), (8, 122), (16, 61)]