In previous articles, I've introduced how to deploy airflow into kubernetes and how to organize airflow project in an efficinet way, in real development work, I've also seem someone use airflow to do big data processing. I think this is a very big missunderstaing about ariflow. in airflow's offical webiste, it already declared very clearly that Airflow is a platform to programmatically author, schedule and monitor workflows.
but not declare itself as a data processing engine. in my view airflow is lacking of 2 critical data processing engine features.
Airflow is lack of capability to pass data between tasks
In a real world data processing flow job, it always involve mutiple steps to process the data, this include
- load data from external data source (e.g: database, csv files, SQS topics or Big Query Tables)
- pass the loaded data into next steps for processing , the process could include mutiple steps transformation or filtering
- persist result back to exterenal data sink
Data processing engine, like Flink
or Spark
they support such concept out of box, to load and sink data , they have different connecoters to connect to different data source. after data is loaded into the engine, the data do not need to perisist into any intermediat storage to share between tasks, the intermediat data could be passed between different tasks out of box. for Flink
it is using Kryo by default. but in airflow it does not support this, when we want to share data between tasks we have to persist the interemediate result to somewhere (e.g: database), then load the data in next tasks, this makes the data processing very insufficient .
Airflow is a shared instance or cluster to manage data jobs
Airflow is heavily using mysql and rabbitmq to manage the dags's task dependnecies and dag runs. Airflow actually is a collection of data jobs, this is ok when we just triggering / scheduling / monitoring data jobs, because in airflow we could set retries
parameter to retry on failure, this may happen when we need to redeploy airflow to enalbe more jobs etc. but when we use airflow to as data processing engine, the data processing maybe interruppted and airflow has no good way to resume the interruppted job without data lost. when we get more and more data processing jobs in airflow , it will be more and more offten that the data processing job will be impacted by other jobs.
on the ohter hand , data processing engine like Flink
or Spark
could be deployed as single data process job per clusetr approach, e.g: we could depliy Spark
job into kubernetes for just one job. this architechture could guarantee that each data processing job is running in an isolated env, so no cross job impact.this gives the developer a great freedom to develop / manage run their own data job without need to think of impact others.
So if you plan to use airflow, calm down for a second,think about what you really want from airflow, if you want the ability to triggering / scheduling / monitoring, go for it , it will do a great job in this area, if you want to process big data, no, no ,no, think about Flink
or Spark
instead.