How Apache Airflow Works
Every data team eventually hits the same wall: dozens of scripts running on cron, failing silently, with no way to see what depends on what. Airflow was built to solve exactly this — it turns pipeline orchestration from a scheduling problem into a dependency graph.
What airflow Does
Orchestrates distributed DAG execution with pluggable task scheduling and monitoring
Apache Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows as Directed Acyclic Graphs (DAGs). The system uses a distributed architecture where schedulers dispatch tasks to workers, with support for multiple execution environments and extensive provider integrations for external systems.
Architecture Overview
airflow is organized into 5 layers, with 10 components and 0 connections between them.
How Data Flows Through airflow
DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.
1DAG Discovery and Parsing
DagFileProcessorManager scans DAG directories, parses Python files in separate processes, validates DAG structure, and serializes valid DAGs to database
Config: dags_folder, dag_file_processor_timeout, max_active_tasks_per_dag
2Schedule Evaluation
SchedulerJobRunner examines DAG schedules against current time, creates DagRun instances for eligible executions, respecting catchup and max_active_runs settings
Config: schedule_interval, catchup, max_active_runs_per_dag
3Task Instance Creation
For each DagRun, scheduler creates TaskInstance records for all tasks, evaluates dependencies, and marks eligible tasks as queued
Config: depends_on_past, wait_for_downstream, task_concurrency
4Task Dispatching
BaseExecutor implementations (LocalExecutor, CeleryExecutor, KubernetesExecutor) receive queued TaskInstances and dispatch them to available workers
Config: executor, parallelism, worker_concurrency
5Task Execution
TaskSDKRunner or legacy task execution loads operator code, executes task logic, handles XCom data exchange, and reports state changes back to database
Config: task_execution_timeout, xcom_backend, enable_xcom_pickling
6State Synchronization
Executors poll task workers for completion, update TaskInstance states in metadata database, trigger downstream task evaluation
Config: job_heartbeat_sec, scheduler_heartbeat_sec
System Dynamics
Beyond the pipeline, airflow has runtime behaviors that shape how it responds to load, failures, and configuration changes.
Data Pools
Metadata Database
SQLAlchemy-managed database storing DAGs, task instances, dag runs, connections, variables, and execution history
Type: database
DagBag Cache
In-memory cache of parsed DAG objects with lazy loading and refresh mechanisms
Type: in-memory
Executor Task Queue
Internal queue of tasks waiting for execution by worker processes
Type: queue
XCom Storage
Cross-communication storage for data exchange between tasks with pluggable backends
Type: database
Feedback Loops
Scheduler Heartbeat Loop
Trigger: Timer-based interval → SchedulerJobRunner evaluates schedules, creates tasks, processes state changes (exits when: Scheduler shutdown signal)
Type: polling
Task Retry Loop
Trigger: Task failure with retry configuration → TaskInstance state reset to queued, re-dispatched to executor with exponential backoff (exits when: Max retries reached or task success)
Type: retry
DAG File Processing Loop
Trigger: File system changes or periodic scan → DagFileProcessorManager rescans DAG directory, reparses modified files (exits when: Process shutdown)
Type: polling
Executor Heartbeat Loop
Trigger: Configured heartbeat interval → BaseExecutor checks worker status, updates task states, dispatches new tasks (exits when: Executor shutdown)
Type: polling
Control Points
Executor Selection
Parallelism Limits
DAG Concurrency
Scheduler Heartbeat
Task Retry Behavior
Delays
DAG File Parsing
Duration: Configurable timeout per file
Schedule Lag
Duration: Up to scheduler heartbeat interval
Task Queuing
Duration: Depends on worker availability
Database Lock Contention
Duration: Variable based on load
Technology Choices
airflow is built with 8 key technologies. Each serves a specific role in the system.
Key Components
- SchedulerJobRunner (orchestrator): Main scheduler daemon that discovers DAGs, evaluates schedules, creates task instances, and dispatches them to executors
- DagFileProcessorManager (processor): Manages subprocess-based DAG file parsing, handles DAG discovery and serialization with multiprocessing
- BaseExecutor (dispatcher): Abstract base for pluggable task execution backends that queue tasks and manage worker processes
- TaskSDKRunner (executor): Standalone task execution runtime that runs individual tasks with proper isolation and state reporting
- FastAPIApp (gateway): Main web application providing REST API endpoints and mounting UI components for external interaction
- BaseOperator (factory): Base class for all task operators that defines task execution interface and dependency management
- BaseHook (adapter): Abstract interface for external system connections used by operators to interact with databases, APIs, and services
- DagBag (registry): Container that holds parsed DAG objects in memory with lazy loading and error handling
- TaskInstanceKeyType (transformer): Serializable identifier for task instances used in executor queues and cross-process communication
- XComManager (store): Cross-communication system for passing data between tasks with pluggable backends
Who Should Read This
Data engineers evaluating orchestration tools, or anyone inheriting an Airflow codebase and trying to understand how the pieces fit together.
This analysis was generated by CodeSea from the apache/airflow source code. For the full interactive visualization — including pipeline graph, architecture diagram, and system behavior map — see the complete analysis.
Explore Further
Full Analysis
Interactive architecture map for airflow
How Prefect Works
Data Pipelines
How dbt Works
Data Pipelines
Frequently Asked Questions
What is airflow?
Orchestrates distributed DAG execution with pluggable task scheduling and monitoring
How does airflow's pipeline work?
airflow processes data through 6 stages: DAG Discovery and Parsing, Schedule Evaluation, Task Instance Creation, Task Dispatching, Task Execution, and more. DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.
What tech stack does airflow use?
airflow is built with FastAPI (Modern web framework providing REST API endpoints and automatic OpenAPI documentation), SQLAlchemy (ORM layer managing metadata database operations and schema migrations), Celery (Distributed task queue system for scaling task execution across multiple workers), Kubernetes (Container orchestration platform for running tasks as isolated pods), Pydantic (Data validation and serialization for API request/response models), and 3 more technologies.
How does airflow handle errors and scaling?
airflow uses 4 feedback loops, 5 control points, 4 data pools to manage its runtime behavior. These mechanisms handle error recovery, load distribution, and configuration changes.