The app for independent voices

Airflow remains a popular choice when teams need to pick an open-source orchestration solution.But it is often set up the wrong way.

Here are a few common mistakes DevOps and Data teams make when they spin up Airflow.

1. The DAG folder is in the same Repo as the main Airflow infrastructure - That is to say, the DAG folder is not separated from the main Airflow container(or however you have airflow set-up). Meaning that when you make a change to a DAG you need to completely restart all of Airflow. This can cause jobs to fail and, is just inconvenient. Your DAG folder should be set up in your Airflow.cfg file in a location that has a separate repo.

2. Configuring The Log Folder To A Local Location - This is actually a mistake I made back in 2015 or 2016. Airflow had just come out and I was excited to use it. So I spun it up on an Ec2 instance without thinking about where the log files go. 6 months later the Ec2 instance crashed because the Logs blew it up. I still see this happening today

3. Building DAGs That Can’t Be Backfilled - Now diving into the DAGs themselves. One of the benefits of Airflow is that you can run it in such a way that if a DAG failed in the past, it’s easy to re-run without ruining the current data set. Some of this was due to the fact that many people using Airflow used a snapshot approach to data modeling. But whatever your approach is, you need to make maintaining the data that is created by your DAGs easy

4. Not Preparing For Scale - If you’re rushed to deploy Airflow, you might not realize, or you might not take the time to think through, how workers and schedulers will scale out as more jobs start to be deployed.When you first start deploying Airflow DAGs, you won’t notice this issue. All your DAGs will run (unless you have a few long-running ones) without a problem. But then you start having 20, 30, 100 DAGs, and you’ll notice that DAGs will be sitting in the light green stage for a while before they run. Now, one issue might be that you need to change some configurations in your airflow.cfg, but there is a chance you might need to start getting DevOps involved.

Are there any other tips or common mistakes you see teams make when setting up Airflow?You can read the full article here -

Mistakes I Have Seen When Data Teams Deploy Airflow
Apr 8, 2024
at
8:36 PM

Log in or sign up

Join the most interesting and insightful discussions.