ML Orchestration: Why It's Time to Move Past Airflow
Airflow is one of the most successful open-source projects out there with over 30k stars on GitHub and thousands of businesses using it around the world. It is arguably the first mainstream Direct Acyclic Graph (DAG) orchestrator dedicated to data workflows.
It was originally started in 2014 by Maxime Beauchemin at Airbnb, and joined the Apache Foundation in 2016. Since then, it has been widely deployed by Data Engineering teams to automate ETL processes and other data-related workflows. It is now being commercialized by Astronomer.io.
Airflow is an amazing product, very versatile, and well supported. It is likely the DAG orchestrator with the most adoption, and therefore is a serious option to consider when trying to automate workflows.
Is it suitable for Machine Learning workflows though?
What are Machine Learning workflows?
ML workflows are CI/CD for models
ML models are not static assets. They constantly need to be retrained with new data, tested, fine-tuned, debugged, experimented on, etc. For example, once a model is deployed to production, it will start to drift after some time, so it needs to be retrained with more recent data.
Therefore, a model should be reproducible in a repeatable manner “from scratch”, that is starting from production-grade data assets sitting in your Data Warehouse. For example, a typical training workflow can contain the following sequential steps:
- Data mining – select training data
- Data cleaning – e.g. remove outliers, normalize distributions, one-hot encode features, process images, etc.
- Data processing – convert training data into the final feature format for consumption by the model training code
- Train/test split
- Model training
- Model evaluation
- Model testing – regression testing against past failure modes
- Metrics generation – generate a human-digestible summary of metrics
- Model deployment
These steps need to be automated in a workflow so that they can be easily repeated to refresh or debug the model.
A workflow serves as a source of truth of how a model can be produced from scratch.
Here are other examples of ML workflows
- End-to-end training – the workflow described above, which starts from scratch and produces a final deployed model
- Regression testing – just like Continuous Integration, a model can be retrained and tested against a golden dataset every time changes are made to any model-critical code (e.g. training loop, data processing code, etc.)
- Train/Eval loops – to iterate on model architecture with a fixed training dataset
- Hyperparameter tuning
- Data procession chains – to sequence multiple data processing tasks (e.g. Spark, map/reduce, SQL scripts, etc.)
Iterative development: how ML Engineers work
ML engineering is more akin to Science than to Engineering. ML work is inherently iterative. ML practitioners experiment continuously with various model architectures, data sampling strategies, hyperparameters, etc. They need a very fast feedback loop: change code, run, view results, repeat. ML Engineering is essentially research work: trial and error.
This is quite different from Data Engineering work, for which Airflow was designed. Take a typical ETL pipeline. Maybe it reads data from a production database, flattens it, standardizes it, and stores it in a data warehouse. Data Engineers can iterate on each individual query in a SQL editor, and then automate them. The space of possible parameters and changes is more narrow than for Machine Learning.
The problem with Airflow
Airflow is an amazing product, but is not suitable for Machine Learning work.
Main problems
These are blockers that make Airflow a less than ideal tool for ML development.
Lack of iterative development
Airflow pipelines are declared in Python modules, and must be stored in a predetermined $AIRFLOW_HOME/dags directory. That is how the Airflow scheduler is able to find them when attempting to start one. This means that in order to update a DAG, its Python module needs to be shipped to the server where Airflow is running so that it can be picked up. This poses the following issues:
- there is no standardized process to update a DAG, short of literally copying the DAG file at the right location (e.g. through SSH, or a full redeploy of the instance)
- new dependencies (e.g. new Python modules, new pip packages) need to be present on the Airflow instance. This means that adding a new pip dependency likely requires a new deployment of the instance.
- once a DAG is updated, it is updated for everyone. Different users cannot run different versions of the same DAG. This seems absolutely prohibitive in a research environment
For all these reasons, Airflow does not provide any way to quickly iterate on pipelines, make code changes, submit jobs, view results, and keep iterating.
Lack of local execution
Airflow itself can run on a local machine. Users can set up their own local Airflow instance and run workflows locally. However, workflows cannot be executed locally against a deployed Airflow instance.
Why does that matter?
As part of their iterative development process (aka research), ML Engineers frequently need to test or debug things locally on a small amount of data before scaling their workloads to their cloud clusters. Having to set up and run a local Airflow instance for this purpose is widely impractical. It also means results from their local runs would only be available on their local machines, impossible to share or compare with results from other runs ran on the deployed instance.
Lack of Lineage Tracking
Lineage Tracking is the exhaustive tracking of all assets that went into training a model: data, configuration, code, resources, intermediate artifacts, etc. Read our article about Lineage Tracking to learn more.
Airflow does not perform a thorough tracking of all these assets. It has a broad and basic concept of Dataset, and their producer task, but it is a far cry from an exhaustive tracking of all data, configuration, code, and resources.
This means that it does not enable ML teams to answer questions such as: “what hyperparameters were used to train this model?”. ML Platform teams will need to build additional layers and tooling to enable this source-of-truth traceability.
Lack of visualizations
Because Airflow does not systematically track inputs and outputs of workflow steps, or any other intermediate data, it is unable to produce visualizations of business logic information. For example, loss curves, Precision-Recall curves, confusion matrices, etc. These are crucial for ML Engineers to figure out what the next step should be in their experimentation flow.
Again, ML Platform teams will need to build additional tooling layers on top of Airflow in order to enable visualizations.
Minor problems
These are non-blocking annoyances.
UI from another century
A good User Interface is as useful to an infrastructure product as a good SDK. UIs should be visually pleasing but also designed to surface the right information.
The Airflow UI is an eye sore. Some may say aesthetics do not matter, beauty is only skin deep. We argue a visually pleasing product provides more satisfaction and joy to use. Airflow’s UI has not evolved much since its inception a decade ago, and already then it was sub-par.
Beyond the esthetics, the Airflow UI is very much geared towards job statuses and job management, green squares and dots everywhere. Very little information is provided about the innards of each task: inputs, outputs, logs, code, exceptions, business metrics, etc.
Lack of composability
Because Airflow has different primitives for DAGs (@dag) and Tasks (@task), it is hard to compose long end-to-end pipelines from multiple smaller pipelines. Doing such is useful to enable users to iterate on smaller sub-pipelines (e.g. train_eval, prepare_training_dataset) before piecing them together in an end-to-end pipeline.
Airflow does provide a TriggerDagRunOperator but this does not seem as natural to use as what other tools provide.
Dynamic graphs
DAGs are called dynamic when their topology is unknown prior to execution. The outcome of the business logic will dictate the final shape of the DAG. For example, depending on the output of a task, users may want to proceed to different next tasks. Or, users may want to loop over an input (e.g. training configuration) to start multiple training jobs in parallel.
Once again, doing so is not strictly impossible in Airflow but requires quite some SDK gymnastics.
More power to Airflow!
Despite the shortcomings listed above, we must be fair and mention also Airflow’s qualities.
- Multiple language support – Thanks to its concept of Operators, Airflow can readily execute tasks in many languages (e.g. Python, SQL, Bash, etc.). That can be very useful when working in a monorepo, or with tasks that cannot be easily expressed in Python.
- Battle tested – Airflow has broad adoption across the industry. It is well supported and very stable.
- Large community – Since it has been around and open-source for almost ten years, and since it was the first mainstream DAG orchestrator, Airflow is immensely popular (over 30,000 GitHub stars) and therefore has a gathered a large community of users and supporters.
Wrap-up
In this article, we exposed the shortcomings that are preventing Airflow from being useful in orchestrating Machine Learning workflows.
In order to use Airflow for ML, teams will need to build many additional layers: lineage tracking, iterative development, visualizations, local execution, etc.
At Sematic, we have witnessed many ML teams trying their hands at Airflow for ML (because “It’s the most popular”, and “the Data teams are already using it”), and eventually failing to make it a useful tool.
What has been your experience using Airflow for Machine Learning? Let us know on Twitter or Discord.