How Apache Airflow Works

Every data team eventually hits the same wall: dozens of scripts running on cron, failing silently, with no way to see what depends on what. Airflow was built to solve exactly this — it turns pipeline orchestration from a scheduling problem into a dependency graph.

45,091 stars Python 10 components 6-stage pipeline

What airflow Does

Orchestrates distributed DAG execution with pluggable task scheduling and monitoring

Apache Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows as Directed Acyclic Graphs (DAGs). The system uses a distributed architecture where schedulers dispatch tasks to workers, with support for multiple execution environments and extensive provider integrations for external systems.

Architecture Overview

airflow is organized into 5 layers, with 10 components and 0 connections between them.

Control Plane

FastAPI web server and airflowctl CLI for managing deployments, monitoring execution, and user interaction

Orchestration Core

Scheduler daemon that parses DAGs, creates task instances, and coordinates execution across workers

Task Execution

Pluggable executors and task SDK that run actual task code in distributed environments

Provider Ecosystem

70+ provider packages with operators, hooks, and connections for external systems

Shared Infrastructure

Common utilities for configuration, serialization, logging, and secrets management

How Data Flows Through airflow

DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.

1DAG Discovery and Parsing

DagFileProcessorManager scans DAG directories, parses Python files in separate processes, validates DAG structure, and serializes valid DAGs to database

Config: dags_folder, dag_file_processor_timeout, max_active_tasks_per_dag

2Schedule Evaluation

SchedulerJobRunner examines DAG schedules against current time, creates DagRun instances for eligible executions, respecting catchup and max_active_runs settings

Config: schedule_interval, catchup, max_active_runs_per_dag

3Task Instance Creation

For each DagRun, scheduler creates TaskInstance records for all tasks, evaluates dependencies, and marks eligible tasks as queued

Config: depends_on_past, wait_for_downstream, task_concurrency

4Task Dispatching

BaseExecutor implementations (LocalExecutor, CeleryExecutor, KubernetesExecutor) receive queued TaskInstances and dispatch them to available workers

Config: executor, parallelism, worker_concurrency

5Task Execution

TaskSDKRunner or legacy task execution loads operator code, executes task logic, handles XCom data exchange, and reports state changes back to database

Config: task_execution_timeout, xcom_backend, enable_xcom_pickling

6State Synchronization

Executors poll task workers for completion, update TaskInstance states in metadata database, trigger downstream task evaluation

Config: job_heartbeat_sec, scheduler_heartbeat_sec

System Dynamics

Beyond the pipeline, airflow has runtime behaviors that shape how it responds to load, failures, and configuration changes.

Data Pools

Pool

Metadata Database

SQLAlchemy-managed database storing DAGs, task instances, dag runs, connections, variables, and execution history

Type: database

Pool

DagBag Cache

In-memory cache of parsed DAG objects with lazy loading and refresh mechanisms

Type: in-memory

Pool

Executor Task Queue

Internal queue of tasks waiting for execution by worker processes

Type: queue

Pool

XCom Storage

Cross-communication storage for data exchange between tasks with pluggable backends

Type: database

Feedback Loops

Loop

Scheduler Heartbeat Loop

Trigger: Timer-based interval → SchedulerJobRunner evaluates schedules, creates tasks, processes state changes (exits when: Scheduler shutdown signal)

Type: polling

Loop

Task Retry Loop

Trigger: Task failure with retry configuration → TaskInstance state reset to queued, re-dispatched to executor with exponential backoff (exits when: Max retries reached or task success)

Type: retry

Loop

DAG File Processing Loop

Trigger: File system changes or periodic scan → DagFileProcessorManager rescans DAG directory, reparses modified files (exits when: Process shutdown)

Type: polling

Loop

Executor Heartbeat Loop

Trigger: Configured heartbeat interval → BaseExecutor checks worker status, updates task states, dispatches new tasks (exits when: Executor shutdown)

Type: polling

Control Points

Control

Executor Selection

Control

Parallelism Limits

Control

DAG Concurrency

Control

Scheduler Heartbeat

Control

Task Retry Behavior

Delays

Delay

DAG File Parsing

Duration: Configurable timeout per file

Delay

Schedule Lag

Duration: Up to scheduler heartbeat interval

Delay

Task Queuing

Duration: Depends on worker availability

Delay

Database Lock Contention

Duration: Variable based on load

Technology Choices

airflow is built with 8 key technologies. Each serves a specific role in the system.

FastAPI

Modern web framework providing REST API endpoints and automatic OpenAPI documentation

SQLAlchemy

ORM layer managing metadata database operations and schema migrations

Celery

Distributed task queue system for scaling task execution across multiple workers

Kubernetes

Container orchestration platform for running tasks as isolated pods

Pydantic

Data validation and serialization for API request/response models

Alembic

Database migration tool for managing schema version changes

Rich/Click

Enhanced CLI interfaces with better formatting and user experience

Pendulum

Timezone-aware datetime handling for schedule calculations

Key Components

SchedulerJobRunner (orchestrator): Main scheduler daemon that discovers DAGs, evaluates schedules, creates task instances, and dispatches them to executors
DagFileProcessorManager (processor): Manages subprocess-based DAG file parsing, handles DAG discovery and serialization with multiprocessing
BaseExecutor (dispatcher): Abstract base for pluggable task execution backends that queue tasks and manage worker processes
TaskSDKRunner (executor): Standalone task execution runtime that runs individual tasks with proper isolation and state reporting
FastAPIApp (gateway): Main web application providing REST API endpoints and mounting UI components for external interaction
BaseOperator (factory): Base class for all task operators that defines task execution interface and dependency management
BaseHook (adapter): Abstract interface for external system connections used by operators to interact with databases, APIs, and services
DagBag (registry): Container that holds parsed DAG objects in memory with lazy loading and error handling
TaskInstanceKeyType (transformer): Serializable identifier for task instances used in executor queues and cross-process communication
XComManager (store): Cross-communication system for passing data between tasks with pluggable backends

Who Should Read This

Data engineers evaluating orchestration tools, or anyone inheriting an Airflow codebase and trying to understand how the pieces fit together.

This analysis was generated by CodeSea from the apache/airflow source code. For the full interactive visualization — including pipeline graph, architecture diagram, and system behavior map — see the complete analysis.

Explore Further

Full Analysis

Interactive architecture map for airflow

How Prefect Works

Data Pipelines

How dbt Works

Data Pipelines

Frequently Asked Questions

What is airflow?

Orchestrates distributed DAG execution with pluggable task scheduling and monitoring

How does airflow's pipeline work?

airflow processes data through 6 stages: DAG Discovery and Parsing, Schedule Evaluation, Task Instance Creation, Task Dispatching, Task Execution, and more. DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.

What tech stack does airflow use?

airflow is built with FastAPI (Modern web framework providing REST API endpoints and automatic OpenAPI documentation), SQLAlchemy (ORM layer managing metadata database operations and schema migrations), Celery (Distributed task queue system for scaling task execution across multiple workers), Kubernetes (Container orchestration platform for running tasks as isolated pods), Pydantic (Data validation and serialization for API request/response models), and 3 more technologies.

How does airflow handle errors and scaling?

airflow uses 4 feedback loops, 5 control points, 4 data pools to manage its runtime behavior. These mechanisms handle error recovery, load distribution, and configuration changes.

How Apache Airflow Works

What airflow Does

Architecture Overview

How Data Flows Through airflow

1DAG Discovery and Parsing

2Schedule Evaluation

3Task Instance Creation

4Task Dispatching

5Task Execution

6State Synchronization

System Dynamics

Data Pools

Metadata Database

DagBag Cache

Executor Task Queue

XCom Storage

Feedback Loops

Scheduler Heartbeat Loop

Task Retry Loop

DAG File Processing Loop

Executor Heartbeat Loop

Control Points

Executor Selection

Parallelism Limits

DAG Concurrency

Scheduler Heartbeat

Task Retry Behavior

Delays

DAG File Parsing

Schedule Lag

Task Queuing

Database Lock Contention

Technology Choices

Key Components

Who Should Read This

Explore Further

Full Analysis

How Prefect Works

How dbt Works

Frequently Asked Questions

Visualize airflow yourself