Step-by-Step Process to Building Directed Acyclic Graphs (DAGs) with Airflow

December 13, 2022

Photo by charlesdeluvio on Unsplash

Everyday more and more data is becoming readily available from various forms. With this massive increase in data, it becomes easy for it get lost or get unnoticed as it comes in. To solve this problem, data engineers build data pipelines to control the flow of the data from one point to another. It is frequently referred to in the data engineering world as ETL, Extract, Transform, and Load.

Directed Acyclic Graphs are one tool for controlling the flow of data (DAGs). DAGs are a collection of tasks and operations that are performed on data in a specific order. These tasks dictate what happens to each piece of data as it flows through the pipeline. Implementation of DAGs is done with Apache Airflow which was initially built and developed by the team of Airbnb, who were at the time looking for quicker and more efficient ways to maintain and update their websites. It was later made open-source and transferred to the Apache Software Foundation.

In this article, you will understand what DAGs are all about and implement it following a step by step process in python.

What is Apache Airflow

Apache Airflow is an open-source platform used to create, schedule, and monitor workflows.

Airflow is composed of two elements:

Using the Airflow web server, users can manage and monitor their workflows, as well as perform some administrative actions such as managing users and connections. The CLI tool allows users to manage the Airflow environment and control the execution of workflows.

Workflows in Airflow are defined as directed acyclic graphs (DAGs), which are sets of tasks with dependencies between them. Tasks can be any kind of action, such as executing a Bash script, running a Python function, or calling an API.

Airflow also has a rich ecosystem of plugins that users can install to extend the functionality of the platform. For example, there are plugins for various databases, cloud services, and messaging systems, which allow users to integrate Airflow with those services.

What is Directed Acyclic Graphs (DAGs)

Photo by JJ Ying on Unsplash

A directed acyclic graph (DAG) is a type of graph that consists of a set of vertices (or nodes) connected by directed edges. A directed edge is an arrow that shows the direction of the relationship between two vertices. For example, if there is an edge from vertex A to vertex B, it means that there is a relationship from A to B, but not from B to A.

Unlike a regular graph, a DAG has no cycles, which means that there are no paths in the graph that start and end at the same vertex and follow the edges of the graph. This means that, in a DAG, it is possible to reach any vertex from any other vertex, but there is only one way to do it.

One way to think about a DAG is as a flowchart or a decision tree. In a flowchart, the vertices represent the different stages or steps of a process, and the edges represent the possible paths or decisions that can be made at each stage. In a decision tree, the vertices represent the different choices or options that are available, and the edges represent the relationships or dependencies between those choices.

For example, imagine that you are working on a machine learning project to classify images of animals into different categories (e.g. dogs, cats, birds, etc.). You can represent the different steps of your project as vertices in a DAG, and the connections between those steps as edges. The first vertex could be “Collect data”, and it could be connected to other vertices such as “Preprocess data”, “Train model”, and “Evaluate model”. Each of these vertices could also be connected to other vertices that represent the different options or sub-steps available at each step. For example, the “Preprocess data” vertex could be connected to vertices representing different types of preprocessing (e.g. normalization, augmentation, etc.), while the “Train model” vertex could be connected to vertices representing different types of models (e.g. neural networks, decision trees, etc.).

In this example, the DAG represents the different steps and choices that are involved in building a machine learning model, and the dependencies between those steps and choices. It allows you to see at a glance which steps need to be completed before others, and which choices are available at each step. It also ensures that there are no conflicts or inconsistencies in the workflow, since a DAG cannot have any cycles or loops.

Implementing DAGs with Python

Photo by Artturi Jalli on Unsplash

In this section, you will create a DAG that solves a quadratic equation in three separate tasks.

  • To get started, install the Apache Airflow python package using

pip install “apache-airflow[celery]==2.5.0” — constraint

“https://raw.githubusercontent.com/apache/airflow/constraints

2.5.0/constraints-3.7.txt”

You can check out the official Apache Airflow documentation here if you run into any issues with the installation.

  • Next, open your IDE and create a python file where you will create the DAG.
  • Next, import the DAG and PythonOperator class from airflow and airflow.operators.python respectively.

from airflow import DAG

from airflow.operators.python import PythonOperator,

  • Now you can go ahead to create an instance of your DAG. While creating the DAG, there are specific parameters which you need to pass.

dag = DAG(

dag_id=’quadratic_equation’,

description=’Calculate the quadratic equation (ax² + bx + c = 0)’, schedule_interval=None,

start_date=datetime(2022, 1, 1),

catchup=False

)

  • Next, create the tasks or operators.

square_root_task = PythonOperator(

task_id=’square_root’,

python_callable=calculate_square_root,

op_kwargs={‘a’: 1, ‘b’: 2, ‘c’: -3},

dag=dag

)

quadratic_equation_task = PythonOperator(

task_id=’quadratic_equation’,

python_callable=calculate_quadratic_equation,

op_kwargs={‘a’: 1, ‘b’: 2, ‘c’: -3},

dag=dag

)

Each of the tasks above have a task_id, a callable python function that performs the actions of the tasks, together with keyword arguments to be passed to the functions.

  • Next, define the callable functions to calculate the values of the quadratic equation.

def calculate_square_root(a, b, **kwargs):

return b * b — 4 * a * c

def calculate_quadratic_equation(a, b, c, **kwargs):

square_part = calculate_square_root(a, b, c)

return (-1 * b / (2 * a)) * ((square_part) ** 1/2)

  • Lastly, define the tasks flow. This means which function gets run first and which comes later. To do this, you will use the >> or << sign. The >> sign refers to a downstream, meaning it comes after a particular task, while << refers to an upstream, meaning it comes before any task. Here’s how to implement it in the code above:

square_root_task >> quadratic_equation_task

The code above means quadratic_equation_task is a downstream to

Conclusion

In conclusion, directed acyclic graphs (DAGs) are a powerful tool for representing and managing complex workflows. By using Airflow to build DAGs, you can easily define, schedule, and monitor your workflows in a consistent and reliable manner. Whether you are working on a machine learning project, a data pipeline, or any other type of workflow, Airflow and DAGs can help you streamline your processes and ensure that your workflows are executed in the correct order and with the right dependencies. By following the step-by-step process outlined in this article, you can quickly and easily get started with building DAGs with Airflow and take control of your workflows.