Parallel Pip install

Will it help if you have your build system (e.g. Jenkins) build and install everything into a build-specific virtual environment directory? When the build succeeds, you make the virtual environment relocatable, tarball it and push the resulting tablall to your "released-tarballs" storage. At deploy time, you need to grab the latest tarball and unpack it on the destination host and then it should be ready to execute. So if it takes 2 seconds to download the tarball and 0.5 seconds to unpack it on the destination host, your deployment will take 2.5 seconds.

The advantage of this approach is that all package installations happen at build time, not at deploy time.

Caveat: your build system worker that builds/compiles/installs things into a virtual env must use same architecture as the target hardware. Also your production box provisioning system will need to take care of various C library dependencies that some Python packages may have (e.g. PIL requires that libjpeg installed before it can compile JPEG-related code, also things will break if libjpeg is not installed on the target box)

It works well for us.

Making a virtual env relocatable:

virtualenv --relocatable /build/output/dir/build-1123423

In this example build-1123423 is a build-specific virtual env directory.

Building on Fatal's answer, the following code does parallel Pip download, then quickly installs the packages.

First, we download packages in parallel into a distribution ("dist") directory. This is easily run in parallel with no conflicts. Each package name is printed out is printed out before download, which helps with debugging. For extra help, change the -P9 to -P1, to download sequentially.

After download, the next command tells Pip to install/update packages. Files are not downloaded, they're fetched from the fast local directory.

It's compatible with the current version of Pip 1.7, also with Pip 1.5.

To install only a subset of packages, replace the 'cat requirements.txt' statement with your custom command, e.g. 'egrep -v github requirement.txt'

cat requirements.txt | xargs -t -n1 -P9 pip install -q --download ./dist

pip install --no-index --find-links=./dist -r ./requirements.txt

Parallel pip installation

This example uses xargs to parallelize the build process by approximately 4x. You can increase the parallelization factor with max-procs below (keep it approximately equal to your number of cores).

If you're trying to e.g. speed up an imaging process that you're doing over and over, it might be easier and definitely lower bandwidth consumption to just image directly on the result rather than do this each time, or build your image using pip -t or virtualenv.

Download and install packages in parallel, four at a time:

xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt

Note: xargs has different parameter names on different Linux distributions. Check your distribution's man page for specifics.

Same thing inlined using a here-doc:

 cat << EOF | xargs --max-args=1 --max-procs=4 sudo pip install
 awscli
 bottle
 paste
 boto                                                                         
 wheel
 twine                                                                        
 markdown
 python-slugify
 python-bcrypt
 arrow
 redis
 psutil
 requests
 requests-aws
 EOF

Warning: there is a remote possibility that the speed of this method might confuse package manifests (depending on your distribution) if multiple pip's try to install the same dependency at exactly the same time, but it's very unlikely if you're only doing 4 at a time. It could be fixed pretty easily by pip install --uninstall depname.

Have you analyzed the deployment process to see where the time really goes? It surprises me that running multiple parallel pip processes does not speed it up much.

If the time goes to querying PyPI and finding the packages (in particular when you also download from Github and other sources) then it may be beneficial to set up your own PyPI. You can host PyPI yourself and add the following to your requirements.txt file (docs):

--extra-index-url YOUR_URL_HERE

or the following if you wish to replace the official PyPI altogether:

--index-url YOUR_URL_HERE

This may speed up download times as all packages are now found on a nearby machine.

A lot of time also goes into compiling packages with C code, such as PIL. If this turns out to be the bottleneck then it's worth looking into compiling code in multiple processes. You may even be able to share compiled binaries between your machines (but many things would need to be similar, such as operating system, CPU word length, et cetera)

Parallel Pip install

Parallel pip installation

Tags:

Python

Parallel Processing

Pip

Related

Recent Posts