apache/airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

45,091 stars Python 10 components

Orchestrates distributed DAG execution with pluggable task scheduling and monitoring

DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 10-component repository with 0 connections. 8016 files analyzed. Minimal connections — components operate mostly in isolation.

How Data Flows Through the System

DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.

  1. DAG Discovery and Parsing — DagFileProcessorManager scans DAG directories, parses Python files in separate processes, validates DAG structure, and serializes valid DAGs to database (config: dags_folder, dag_file_processor_timeout, max_active_tasks_per_dag)
  2. Schedule Evaluation — SchedulerJobRunner examines DAG schedules against current time, creates DagRun instances for eligible executions, respecting catchup and max_active_runs settings [DAG → DagRun] (config: schedule_interval, catchup, max_active_runs_per_dag)
  3. Task Instance Creation — For each DagRun, scheduler creates TaskInstance records for all tasks, evaluates dependencies, and marks eligible tasks as queued [DagRun → TaskInstance] (config: depends_on_past, wait_for_downstream, task_concurrency)
  4. Task Dispatching — BaseExecutor implementations (LocalExecutor, CeleryExecutor, KubernetesExecutor) receive queued TaskInstances and dispatch them to available workers [TaskInstance → TaskInstanceKeyType] (config: executor, parallelism, worker_concurrency)
  5. Task Execution — TaskSDKRunner or legacy task execution loads operator code, executes task logic, handles XCom data exchange, and reports state changes back to database [TaskInstanceKeyType → Task execution results] (config: task_execution_timeout, xcom_backend, enable_xcom_pickling)
  6. State Synchronization — Executors poll task workers for completion, update TaskInstance states in metadata database, trigger downstream task evaluation [Task execution results → TaskInstance] (config: job_heartbeat_sec, scheduler_heartbeat_sec)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

DAG airflow-core/src/airflow/models/dag.py
Python class with dag_id: str, schedule: ScheduleInterval, start_date: datetime, tasks: Dict[str, BaseOperator], dependencies: Dict[str, List[str]]
Created from Python files during DAG parsing, stored in metadata database, used by scheduler for task creation
TaskInstance airflow-core/src/airflow/models/taskinstance.py
SQLAlchemy model with task_id: str, dag_id: str, run_id: str, state: TaskInstanceState, start_date: datetime, end_date: datetime, executor_config: Dict
Generated by scheduler from DAG tasks, queued for execution, updated during task run lifecycle
DagRun airflow-core/src/airflow/models/dagrun.py
SQLAlchemy model with dag_id: str, run_id: str, execution_date: datetime, state: DagRunState, conf: Dict, data_interval_start/end: datetime
Created by scheduler for each DAG execution, tracks overall run state, completed when all tasks finish
Connection airflow-core/src/airflow/models/connection.py
SQLAlchemy model with conn_id: str, conn_type: str, host: str, port: int, login: str, password: str, extra: Dict
Stored in metadata database or secrets backend, retrieved by hooks during task execution
Variable airflow-core/src/airflow/models/variable.py
SQLAlchemy model with key: str, val: str, description: str, is_encrypted: bool
Stored in metadata database with optional encryption, accessed during DAG parsing and task execution

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

Configuration value 'api.base_url' is a valid URL path that can be safely appended with '/' and parsed with urlsplit() without validation

If this fails: Invalid URL configurations could cause routing failures, security issues with malformed paths, or crashes during URL parsing

airflow-core/src/airflow/api_fastapi/app.py:API_BASE_URL
critical Temporal unguarded

BaseAuthManager instance remains valid throughout the application lifetime once set in the global _AuthManagerState

If this fails: Stale auth manager references could lead to authentication bypass or crashes if the manager becomes invalid during runtime

airflow-core/src/airflow/api_fastapi/app.py:_AuthManagerState
critical Contract unguarded

All mounted FastAPI apps under app.routes have properly functioning lifespan_context managers that don't raise exceptions

If this fails: Exception in any mounted app's lifespan context causes the entire application startup to fail, making Airflow unavailable

airflow-core/src/airflow/api_fastapi/app.py:lifespan
warning Environment weakly guarded

Global ChakraUISystem object exists and is properly initialized by Airflow Core UI before plugin components are rendered

If this fails: Plugin falls back to localSystem but may have inconsistent theming or crash if localSystem is also unavailable

dev/react-plugin-tools/react_plugin_template/src/main.tsx:globalThis.ChakraUISystem
warning Ordering unguarded

URL prefix reservation is enforced consistently across all app mounting operations without race conditions

If this fails: Conflicting URL prefixes could cause routing collisions, overridden endpoints, or security bypasses if auth/api routes are shadowed

airflow-core/src/airflow/api_fastapi/app.py:RESERVED_URL_PREFIXES
warning Resource weakly guarded

Single threading lock is sufficient for all auth manager state mutations across potentially multiple worker processes

If this fails: In multi-process deployments, each process gets its own lock, leading to inconsistent auth manager state across workers

airflow-core/src/airflow/api_fastapi/app.py:threading.Lock()
warning Contract unguarded

DagBag creation succeeds and returns a valid instance that can be used throughout the API lifecycle

If this fails: Failed DagBag creation could cause DAG-related API endpoints to return errors or incorrect data without clear error indication

airflow-core/src/airflow/api_fastapi/app.py:create_dag_bag
warning Environment unguarded

Provider configuration loading completes successfully and all required providers are available before API initialization

If this fails: Missing or failed provider loading could cause runtime errors when API endpoints try to use provider-specific functionality

airflow-core/src/airflow/api_fastapi/app.py:providers_configuration_loaded
warning Domain weakly guarded

Cookie path derived from API_ROOT_PATH is a valid HTTP cookie path that browsers will accept and scope correctly

If this fails: Invalid cookie paths could prevent session management, authentication tokens from working, or cause security issues with cookie scoping

airflow-core/src/airflow/api_fastapi/app.py:get_cookie_path
info Contract unguarded

ChakraProvider initialization always succeeds with the provided system configuration regardless of its structure or completeness

If this fails: Malformed or incomplete Chakra system configs could cause React rendering failures, breaking the entire plugin UI

dev/react-plugin-tools/react_plugin_template/src/main.tsx:PluginComponent

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Metadata Database (database)
SQLAlchemy-managed database storing DAGs, task instances, dag runs, connections, variables, and execution history
DagBag Cache (in-memory)
In-memory cache of parsed DAG objects with lazy loading and refresh mechanisms
Executor Task Queue (queue)
Internal queue of tasks waiting for execution by worker processes
XCom Storage (database)
Cross-communication storage for data exchange between tasks with pluggable backends

Feedback Loops

Delays

Control Points

Technology Stack

FastAPI (framework)
Modern web framework providing REST API endpoints and automatic OpenAPI documentation
SQLAlchemy (database)
ORM layer managing metadata database operations and schema migrations
Celery (runtime)
Distributed task queue system for scaling task execution across multiple workers
Kubernetes (infra)
Container orchestration platform for running tasks as isolated pods
Pydantic (serialization)
Data validation and serialization for API request/response models
Alembic (database)
Database migration tool for managing schema version changes
Rich/Click (library)
Enhanced CLI interfaces with better formatting and user experience
Pendulum (library)
Timezone-aware datetime handling for schedule calculations

Key Components

Package Structure

airflow-core (app)
Core orchestration engine with scheduler, executor, webserver, and DAG processing capabilities
airflow-ctl (tooling)
Command-line management tool for controlling Airflow deployments
task-sdk (library)
Task execution SDK for writing and running Airflow tasks
providers (library)
Extensible provider ecosystem with 70+ integrations for external systems
dev (tooling)
Development tools including Breeze CLI for local testing and build automation
shared (shared)
Common utilities shared across packages for configuration, logging, and serialization

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Repository Repositories

Frequently Asked Questions

What is airflow used for?

Orchestrates distributed DAG execution with pluggable task scheduling and monitoring apache/airflow is a 10-component repository written in Python. Minimal connections — components operate mostly in isolation. The codebase contains 8016 files.

How is airflow architected?

airflow is organized into 5 architecture layers: Control Plane, Orchestration Core, Task Execution, Provider Ecosystem, and 1 more. Minimal connections — components operate mostly in isolation. This layered structure keeps concerns separated and modules independent.

How does data flow through airflow?

Data moves through 6 stages: DAG Discovery and Parsing → Schedule Evaluation → Task Instance Creation → Task Dispatching → Task Execution → .... DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records. This pipeline design reflects a complex multi-stage processing system.

What technologies does airflow use?

The core stack includes FastAPI (Modern web framework providing REST API endpoints and automatic OpenAPI documentation), SQLAlchemy (ORM layer managing metadata database operations and schema migrations), Celery (Distributed task queue system for scaling task execution across multiple workers), Kubernetes (Container orchestration platform for running tasks as isolated pods), Pydantic (Data validation and serialization for API request/response models), Alembic (Database migration tool for managing schema version changes), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does airflow have?

airflow exhibits 4 data pools (Metadata Database, DagBag Cache), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle polling and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does airflow use?

5 design patterns detected: Plugin Architecture, Multiprocessing DAG Parsing, Pluggable Executors, XCom Data Exchange, SQLAlchemy ORM Layer.

Analyzed on April 19, 2026 by CodeSea. Written by .