Apache Airflow allows you to manage complex data pipelines and run tasks in parallel - even distributed among many nodes - but not by default. The stock Airflow configuration comes with SQLite metadata database and
SeqentialExecutor. As the name suggests, it runs tasks sequentially, one after the other.
In plain English, this means your Airflow installation is not production-ready by default. Today we'll fix that by migrating the Airflow metastore database to Postgres and enabling parallel execution on a single machine with
LocalExecutor. If you plan to distribute your workload among different nodes, try
CeleryExecutor, but that's a topic for another time.
Don't feel like reading? Watch my video instead:
Modify Airflow Configuration File (airflow.cfg)
If you were to open the Airflow webserver homepage, you'd see the following two warning messages:
It complains about the metadata database, as SQLite shouldn't be used in production. It also doesn't fancy the
SequentialExecutor, for obvious reasons. We'll have to migrate the metadata database before addressing the executor, because the SQLite database doesn't support parallelism.
airflow.cfg file that's located in your root Airflow directory. In there, starting at line 24, you'll see the following two values specified:
Change the values as follows:
executor = LocalExecutor sql_alchemy_conn = <postgres_connection>
Yes, you'll need Postgres installed and you'll need to create a new database for Airflow. What you see above is just a template you can use to write the connection string.
Once done, you'll have to initialize the database and create the Admin user. Let's do that next.
Initialize Airflow Database and Create the User
There's no optimal way to stop a process that's running in the daemon mode. However, we know it's running on port 8080. Use the
lsof command to list the processes running on a specific port and then kill them:
lsof -i tcp:8080 kill <pid>
You should only kill the first process, and the others running on the same port will be terminated automatically.
From here, run the following command to initialize the database:
airflow db init
The initialization process will complete after a couple of seconds, which means we can create the Admin user next:
airflow users create --username admin --password admin --role Admin --firstname <fname> --lastname <lname> --email <email>
You should see the following success message after a couple of seconds:
Amazing - we've successfully migrated the database and created the Admin user. The only thing left to do is to restart the Airflow webserver and scheduler, so let's do that next.
Restart Apache Airflow
Run the following two commands to run the Airflow webserver and scheduler in the daemon mode:
airflow webserver -D airflow scheduler -D
Once you open the homepage, you won't see any warning messages:
The metadata database migration and the change of Airflow executor were successful, which means you're ready to run tasks in parallel on a single machine.
But what did Airflow actually store in the metadata database? A lot of tables, as it turns out. These are responsible to keep Airflow up and running:
That's all I wanted to cover today, so let's wrap things up next.
Yes, I know, we haven't written any code today - but that's the point. You already know how to write a basic data pipeline with Airflow, and the number one issue you're facing is speed. For example, there's no point in scraping five individual pages one after the other. You can parallelize the process and save yourself some time.
Well, now you can parallelize the process, as we took care of the maintenance and configuration tasks. In the following article, you'll write your first DAG that leverages parallelism, so stay tuned for that.
- 5 Best Books to Learn Data Science Prerequisites (Math, Stats, and Programming)
- Top 5 Books to Learn Data Science in 2022
- How to Install Apache Airflow Locally