How to avoid reinstalling dependencies for each job in Gitlab CI
I prefer use cache because removes files when pipeline finished.
Example
image: node
stages:
- install
- test
- compile
cache:
key: modules
paths:
- node_modules/
install:modules:
stage: install
cache:
key: modules
paths:
- node_modules/
after_script:
- node -v && npm -v
script:
- npm i
test:
stage: test
cache:
key: modules
paths:
- node_modules/
policy: pull
before_script:
- node -v && npm -v
script:
- npm run test
compile:
stage: compile
cache:
key: modules
paths:
- node_modules/
policy: pull
script:
- npm run build
EDIT: This solution was recommended in 2016. In 2021, you might consider the caching docs instead.
A better approach these days is to make use of artifacts.
In the following example, the node_modules/
directory is immediately available to the lint
job once the build
stage has completed successfully.
build:
stage: build
script:
- npm install -q
- npm run build
artifacts:
paths:
- node_modules/
expire_in: 1 week
lint:
stage: test
script:
- npm run lint
Update: I now recommend using artifacts
with a short expire_in
. This is superior to cache
because it only has to write the artifact once per pipeline whereas the cache is updated after every job. Also the cache is per runner so if you run your jobs in parallel on multiple runners it's not guaranteed to be populated, unlike artifacts which are stored centrally.
Gitlab CI 8.2 adds runner caching which lets you reuse files between builds. However I've found this to be very slow.
Instead I've implemented my own caching system using a bit of shell scripting:
before_script:
# unique hash of required dependencies
- PACKAGE_HASH=($(md5sum package.json))
# path to cache file
- DEPS_CACHE=/tmp/dependencies_${PACKAGE_HASH}.tar.gz
# Check if cache file exists and if not, create it
- if [ -f $DEPS_CACHE ];
then
tar zxf $DEPS_CACHE;
else
npm install --quiet;
tar zcf - ./node_modules > $DEPS_CACHE;
fi
This will run before every job in your .gitlab-ci.yml
and only install your dependencies if package.json
has changed or the cache file is missing (e.g. first run, or file was manually deleted). Note that if you have several runners on different servers, they will each have their own cache file.
You may want to clear out the cache file on a regular basis in order to get the latest dependencies. We do this with the following cron entry:
@daily find /tmp/dependencies_* -mtime +1 -type f -delete
From docs:
cache
: Use for temporary storage for project dependencies. Not useful for keeping intermediate build results, likejar
orapk
files. Cache was designed to be used to speed up invocations of subsequent runs of a given job, by keeping things like dependencies (e.g., npm packages, Go vendor packages, etc.) so they don’t have to be re-fetched from the public internet. While the cache can be abused to pass intermediate build results between stages, there may be cases where artifacts are a better fit.
artifacts
: Use for stage results that will be passed between stages. Artifacts were designed to upload some compiled/generated bits of the build, and they can be fetched by any number of concurrent Runners. They are guaranteed to be available and are there to pass data between jobs. They are also exposed to be downloaded from the UI. Artifacts can only exist in directories relative to the build directory and specifying paths which don’t comply to this rule trigger an unintuitive and illogical error message (an enhancement is discussed at https://gitlab.com/gitlab-org/gitlab-ce/issues/15530 ). Artifacts need to be uploaded to the GitLab instance (not only the GitLab runner) before the next stage job(s) can start, so you need to evaluate carefully whether your bandwidth allows you to profit from parallelization with stages and shared artifacts before investing time in changes to the setup.
So, I use cache
. When don't need to update de cache (eg. build folder in a test job), I use policy: pull
(see here).