How to use apache airflow in a virtual environment?
You can set/override airflow options specified in ${AIRFLOW_HOME}/airflow.cfg
with environment variables by using this format: $AIRFLOW__{SECTION}__{KEY} (note the double underscores). Here is a link to airflow docs. So you can simply do
export AIRFLOW__CORE__DAGS_FOLDER=/path/to/dags/folder
However, it is tedious and error prone to do this for different projects. As alternative, you can consider using pipenv for managing virtual environments instead of Anaconda. Here is a nice guide about pipenv
and problems it solves. One of the default features of pipenv
is that it automatically loads variables defined in .env
file when you spawn a shell with the virtualenv activated. So here is what your workflow with pipenv
could look like:
cd /path/to/my_project
# Creates venv with python 3.7
pipenv install --python=3.7 Flask==1.0.3 apache-airflow==1.10.3
# Set home for airflow in a root of your project (specified in .env file)
echo "AIRFLOW_HOME=${PWD}/airflow" >> .env
# Enters created venv and loads content of .env file
pipenv shell
# Initialize airflow
airflow initdb
mkdir -p ${AIRFLOW_HOME}/dags/
Note: usage of
Flask==1.03
I will explain at the end, but this is because pipenv checks whether sub-dependencies are compatible in order to ensure reproducibility.
So after these steps you would get the following project structure
my_project
├── airflow
│ ├── airflow.cfg
│ ├── airflow.db
│ ├── dags
│ ├── logs
│ │ └── scheduler
│ │ ├── 2019-07-07
│ │ └── latest -> /path/to/my_project/airflow/logs/scheduler/2019-07-07
│ └── unittests.cfg
├── .env
├── Pipfile
└── Pipfile.lock
Now when you initialize airflow for the first time it will create ${AIRFLOW_HOME}/airflow.cfg
file and will use/expand ${AIRFLOW_HOME}/dags
as value for dags_folder
. In case you still need a different location for dags_folder
, you can use .env
file again
echo "AIRFLOW__CORE__DAGS_FOLDER=/different/path/to/dags/folder" >> .env
Thus, you .env
file will look like:
AIRFLOW_HOME=/path/to/my_project/airflow
AIRFLOW__CORE__DAGS_FOLDER=/different/path/to/dags/folder
What have we accomplished and why this would work just fine
- Since you installed
airflow
in virtual environment, you would need to activate it in order to useairflow
- Since you did it with
pipenv
, you would need to usepipenv shell
in order to activate venv - Since you use
pipenv shell
, you would always get variables defined in.env
exported into your venv. On top of thatpipenv
will still be a subshell, therefore, when you exit it, all additional environmental variables would be cleared as well. - Different projects that use airflow would have different locations for their log files etc.
Additional notes on pipenv
- In order to use venv created with pipenv as your IDE's project interpreter, use path provided by
pipenv --py
. - By default,
pipenv
creates all venvs in the same global location like conda does, but you can change that behavior to creating.venv
in a project's root by addingexport PIPENV_VENV_IN_PROJECT=1
into your.bashrc
(or otherrc
). Then PyCharm would be able to automatically pick it up when you go into settings of project interpreter.
Note on usage of Flask==1.0.3
Airflow 1.10.3 from PyPi depends on flask>=1.0, <2.0
and on jinja2>=2.7.3, <=2.10.0
. Today, when I tested code snippets the latest available flask
was 1.1.0 which depends on jinja2>=2.10.1
. This means that although pipenv can install all required software, but it fails to lock dependencies. So for clean use of my code samples, I had to specify version of flask
that requires version of jinja2
compatible with airflow requirements. But there is nothing to worry about. The latest version of airflow
on GitHub is already fixed that.