In last article I described how to deploy apache airflow into kubernetes, in this article I'm going to talk about how to organize the airflow project in an efficient way so that it's easy to mantain by both development team and devOps team
Python Dependency Management
Airflow is written in python if you start a fresh project, I recommend to use python 3, as python 2 will be end of life in 2020. I also recommand to use pipenv to manage python dependencies rather than pip, you could refrence this article to see how to setup python development env with pypenv and pipenv. as this article is written, I'm using python 3.6 and airflow 1.10.2 to build my airflow jobs. to install airflow by pipenv, we need a system env AIRFLOW_GPL_UNIDECODE=yes
before run pipenv install apache-airflow
, developers will have their own flexibility to install any python dependencies locally and in the end, we could have a command to export all the dependencies into requirements.txt
. the command to export python dependencies is AIRFLOW_GPL_UNIDECODE=yes pipenv lock -r > requirements.txt
after we have requirements.txt
, we could list it as a build artifact and install them all in the customised airflow image
Airflow Project Structures
- Airflow's main concept is called Directed Acyclic Graph (DAG), it's recommended to have a folder named dags to hold all the dags for the airflow project and keep the dag as simple as possible, many small dags are better than a large complex dag to maintain
- Airflow also have a plugins concept , it's recommended to keep complex logic as a plugin and reuse it in DAG.
- apart from these 2 folder, airflow-kubernetes offer some super handy location to help developers to resolve some generic problems, I recommend to have a config folder under project root folder
- in config folder we could have a folder named
init
, I've implemented a SystemV sytle init script to help developer to init airflow system after airflow webserver started, these script will be run under linux user nameairflow
, developer could leverage this system to init airflow config e.g: connections. - in config folder, we could also have a folder named
super_init
, the script undersuper_init
could be executed under linux usernameroot
, this is super useful in kubernetes env, because the ip of any airflow work will get changed after each deployment and airflow webserver depends on airflow work's hostname to communicate with each work to pull task logs, we could use this system to help airflow webserver to resolve each workre's hostname by having a script like content below undersuper_init
dir
- in config folder we could have a folder named
kubectl get po -n airflow -o wide \
| grep airflow-worker \
| awk '{printf("%s\t%s\n",$6,$1)}'>> /etc/hosts
- it's better to have a Makefile in the root of the project folder so that developer could put some useful command in the Makefile and it's also a good way of code as document
- developer could have their own flexibility to put any other fodlers in the project
Docker Image Management
The image I provide in airflow-kubernetes is a generic airflow image, it just provide python and airflow runtime on linux box. I recommand to maintain a custome airflow image which extends from docker-airflow and push to it's own private docker registry, e.g: GCR or ECR because
- developer may want to install custom software in the image
- developer may want need a way to install 3rd party python dependencies into the image which is specified in
requirements.txt
- developer may want to load some custom config into ariflow config
- developer may want to put some credentials into the image based on business requirement
- developer may also want to have a way to version their own airflow deployment rather than using a base image.
- in a multi workers airflow cluster, it is required that all the python code for airflow are exactly the same across the node, so it's good to build airflow jobs and package all the python code as an artifact , when we build a new image , we could pull the latest artifact and put them into related dirs, then when we deploy the image into kubernetes cluster, we are sure that the python code are exactly the same across different airflow worker nodes
to maintain a custom airflow image is easy, just have a Dockerfile start with FROM email2liyang/docker-airflow:1.10.2
, then you could start to customise your own image to deploy your own business logic into your airflow cluster.