Workflow #1 | Airflow


Introduction

  • Contents
    • Airflow components
    • Architecture Diagrams


1. Airflow Components

  • Airflow’s architecture consists of multiple components.

    • (1) components required for a bare-minimum Airflow installation,
    • (2) an optional component to achieve better Airflow extensibility, performance, and scalability.
  • Required components

    • scheduler, which handles both triggering scheduled workflows, and submitting Tasks to the executor to run.
      • The executor, is a configuration property of the scheduler, not a separate component and runs within the scheduler process.
      • There are several executors available out of the box, and you can also write your own.
    • webserver, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.
    • A folder of DAG files, which is read by the scheduler to figure out what tasks to run and when to run them.
    • metadata database, which airflow components use to store state of workflows and tasks.
  • Optional components

    • Optional worker, which executes the tasks given to it by the scheduler.
      • In the basic installation worker might be part of the scheduler not a separate component.
      • It can be run as a long running process in the CeleryExecutor, or as a POD in the KubernetesExecutor.
    • Optional folder of plugins.
      • Plugins are a way to extend Airflow’s functionality (similar to installed packages).
      • Plugins are read by the schedulerdag processortriggerer and webserver. More about plugins can be found in Plugins.
    • Optional triggerer, which executes deferred tasks in an asyncio event loop.
      • In basic installation where deferred tasks are not used, a triggerer is not necessary.
      • More about deferring tasks can be found in Deferrable Operators & Triggers.
    • Optional dag processor, which parses DAG files and serializes them into themetadata database.
      • By default, the dag processor process is part of the scheduler, but it can be run as a separate component for scalability and security reasons.
      • If dag processor is present scheduler does not need to read the DAG files directly. More about processing DAG files can be found in DAG File Processing


2. Architecture Diagrams

  • connection types in the diagrams

    • brown solid lines represent DAG files submission and synchronization
    • blue solid lines represent deploying and accessing installed packages and plugins
    • black dashed lines represent control flow of workers by the scheduler (via executor)
    • black solid lines represent accessing the UI to manage execution of the workflows
    • red dashed lines represent accessing the metadata database by all components
  • Basic Airflow Deployment

    • The simplest deployment of Airflow, usually operated and managed on a single machine.
    • Such a deployment usually uses the LocalExecutor,
    • The webserver runs on the same machine as the scheduler.
    • There is no triggerer component, which means that task deferral is not possible.

  • Distributed Airflow Architecture
    • Components of Airflow are distributed among multiple machines and where various roles of users are introduced - Deployment ManagerDAG authorOperations User.
    • The webserver does not have access to the DAG files directly.
      • The code in the Code tab of the UI is read from the metadata database.
      • The webserver cannot execute any code submitted by theDAG author.
      • The Operations User only has access to the UI and can only trigger DAGs and tasks, but cannot author DAGs.
    • The DAG files need to be synchronized between all the components that use them - schedulertriggerer and workers.