apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Orchestrates distributed DAG execution with pluggable task scheduling and monitoring
DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.
Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 10-component repository with 0 connections. 8016 files analyzed. Minimal connections — components operate mostly in isolation.
How Data Flows Through the System
DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records.
- DAG Discovery and Parsing — DagFileProcessorManager scans DAG directories, parses Python files in separate processes, validates DAG structure, and serializes valid DAGs to database (config: dags_folder, dag_file_processor_timeout, max_active_tasks_per_dag)
- Schedule Evaluation — SchedulerJobRunner examines DAG schedules against current time, creates DagRun instances for eligible executions, respecting catchup and max_active_runs settings [DAG → DagRun] (config: schedule_interval, catchup, max_active_runs_per_dag)
- Task Instance Creation — For each DagRun, scheduler creates TaskInstance records for all tasks, evaluates dependencies, and marks eligible tasks as queued [DagRun → TaskInstance] (config: depends_on_past, wait_for_downstream, task_concurrency)
- Task Dispatching — BaseExecutor implementations (LocalExecutor, CeleryExecutor, KubernetesExecutor) receive queued TaskInstances and dispatch them to available workers [TaskInstance → TaskInstanceKeyType] (config: executor, parallelism, worker_concurrency)
- Task Execution — TaskSDKRunner or legacy task execution loads operator code, executes task logic, handles XCom data exchange, and reports state changes back to database [TaskInstanceKeyType → Task execution results] (config: task_execution_timeout, xcom_backend, enable_xcom_pickling)
- State Synchronization — Executors poll task workers for completion, update TaskInstance states in metadata database, trigger downstream task evaluation [Task execution results → TaskInstance] (config: job_heartbeat_sec, scheduler_heartbeat_sec)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
airflow-core/src/airflow/models/dag.pyPython class with dag_id: str, schedule: ScheduleInterval, start_date: datetime, tasks: Dict[str, BaseOperator], dependencies: Dict[str, List[str]]
Created from Python files during DAG parsing, stored in metadata database, used by scheduler for task creation
airflow-core/src/airflow/models/taskinstance.pySQLAlchemy model with task_id: str, dag_id: str, run_id: str, state: TaskInstanceState, start_date: datetime, end_date: datetime, executor_config: Dict
Generated by scheduler from DAG tasks, queued for execution, updated during task run lifecycle
airflow-core/src/airflow/models/dagrun.pySQLAlchemy model with dag_id: str, run_id: str, execution_date: datetime, state: DagRunState, conf: Dict, data_interval_start/end: datetime
Created by scheduler for each DAG execution, tracks overall run state, completed when all tasks finish
airflow-core/src/airflow/models/connection.pySQLAlchemy model with conn_id: str, conn_type: str, host: str, port: int, login: str, password: str, extra: Dict
Stored in metadata database or secrets backend, retrieved by hooks during task execution
airflow-core/src/airflow/models/variable.pySQLAlchemy model with key: str, val: str, description: str, is_encrypted: bool
Stored in metadata database with optional encryption, accessed during DAG parsing and task execution
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Configuration value 'api.base_url' is a valid URL path that can be safely appended with '/' and parsed with urlsplit() without validation
If this fails: Invalid URL configurations could cause routing failures, security issues with malformed paths, or crashes during URL parsing
airflow-core/src/airflow/api_fastapi/app.py:API_BASE_URL
BaseAuthManager instance remains valid throughout the application lifetime once set in the global _AuthManagerState
If this fails: Stale auth manager references could lead to authentication bypass or crashes if the manager becomes invalid during runtime
airflow-core/src/airflow/api_fastapi/app.py:_AuthManagerState
All mounted FastAPI apps under app.routes have properly functioning lifespan_context managers that don't raise exceptions
If this fails: Exception in any mounted app's lifespan context causes the entire application startup to fail, making Airflow unavailable
airflow-core/src/airflow/api_fastapi/app.py:lifespan
Global ChakraUISystem object exists and is properly initialized by Airflow Core UI before plugin components are rendered
If this fails: Plugin falls back to localSystem but may have inconsistent theming or crash if localSystem is also unavailable
dev/react-plugin-tools/react_plugin_template/src/main.tsx:globalThis.ChakraUISystem
URL prefix reservation is enforced consistently across all app mounting operations without race conditions
If this fails: Conflicting URL prefixes could cause routing collisions, overridden endpoints, or security bypasses if auth/api routes are shadowed
airflow-core/src/airflow/api_fastapi/app.py:RESERVED_URL_PREFIXES
Single threading lock is sufficient for all auth manager state mutations across potentially multiple worker processes
If this fails: In multi-process deployments, each process gets its own lock, leading to inconsistent auth manager state across workers
airflow-core/src/airflow/api_fastapi/app.py:threading.Lock()
DagBag creation succeeds and returns a valid instance that can be used throughout the API lifecycle
If this fails: Failed DagBag creation could cause DAG-related API endpoints to return errors or incorrect data without clear error indication
airflow-core/src/airflow/api_fastapi/app.py:create_dag_bag
Provider configuration loading completes successfully and all required providers are available before API initialization
If this fails: Missing or failed provider loading could cause runtime errors when API endpoints try to use provider-specific functionality
airflow-core/src/airflow/api_fastapi/app.py:providers_configuration_loaded
Cookie path derived from API_ROOT_PATH is a valid HTTP cookie path that browsers will accept and scope correctly
If this fails: Invalid cookie paths could prevent session management, authentication tokens from working, or cause security issues with cookie scoping
airflow-core/src/airflow/api_fastapi/app.py:get_cookie_path
ChakraProvider initialization always succeeds with the provided system configuration regardless of its structure or completeness
If this fails: Malformed or incomplete Chakra system configs could cause React rendering failures, breaking the entire plugin UI
dev/react-plugin-tools/react_plugin_template/src/main.tsx:PluginComponent
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
SQLAlchemy-managed database storing DAGs, task instances, dag runs, connections, variables, and execution history
In-memory cache of parsed DAG objects with lazy loading and refresh mechanisms
Internal queue of tasks waiting for execution by worker processes
Cross-communication storage for data exchange between tasks with pluggable backends
Feedback Loops
- Scheduler Heartbeat Loop (polling, balancing) — Trigger: Timer-based interval. Action: SchedulerJobRunner evaluates schedules, creates tasks, processes state changes. Exit: Scheduler shutdown signal.
- Task Retry Loop (retry, balancing) — Trigger: Task failure with retry configuration. Action: TaskInstance state reset to queued, re-dispatched to executor with exponential backoff. Exit: Max retries reached or task success.
- DAG File Processing Loop (polling, balancing) — Trigger: File system changes or periodic scan. Action: DagFileProcessorManager rescans DAG directory, reparses modified files. Exit: Process shutdown.
- Executor Heartbeat Loop (polling, balancing) — Trigger: Configured heartbeat interval. Action: BaseExecutor checks worker status, updates task states, dispatches new tasks. Exit: Executor shutdown.
Delays
- DAG File Parsing (compilation, ~Configurable timeout per file) — New DAG changes not visible until parsing completes
- Schedule Lag (eventual-consistency, ~Up to scheduler heartbeat interval) — Tasks may not start immediately when dependencies are met
- Task Queuing (queue-drain, ~Depends on worker availability) — Tasks wait in queue when all workers are busy
- Database Lock Contention (eventual-consistency, ~Variable based on load) — High concurrency scenarios may experience scheduling delays
Control Points
- Executor Selection (architecture-switch) — Controls: Task execution backend (Local, Celery, Kubernetes, etc.). Default: LocalExecutor
- Parallelism Limits (threshold) — Controls: Maximum concurrent task instances across entire Airflow deployment. Default: 32
- DAG Concurrency (rate-limit) — Controls: Maximum concurrent task instances per DAG. Default: 16
- Scheduler Heartbeat (runtime-toggle) — Controls: Frequency of schedule evaluation cycles. Default: 5 seconds
- Task Retry Behavior (hyperparameter) — Controls: Default retry count and delay patterns for failed tasks. Default: 0 retries
Technology Stack
Modern web framework providing REST API endpoints and automatic OpenAPI documentation
ORM layer managing metadata database operations and schema migrations
Distributed task queue system for scaling task execution across multiple workers
Container orchestration platform for running tasks as isolated pods
Data validation and serialization for API request/response models
Database migration tool for managing schema version changes
Enhanced CLI interfaces with better formatting and user experience
Timezone-aware datetime handling for schedule calculations
Key Components
- SchedulerJobRunner (orchestrator) — Main scheduler daemon that discovers DAGs, evaluates schedules, creates task instances, and dispatches them to executors
airflow-core/src/airflow/jobs/scheduler_job_runner.py - DagFileProcessorManager (processor) — Manages subprocess-based DAG file parsing, handles DAG discovery and serialization with multiprocessing
airflow-core/src/airflow/dag_processing/manager.py - BaseExecutor (dispatcher) — Abstract base for pluggable task execution backends that queue tasks and manage worker processes
airflow-core/src/airflow/executors/base_executor.py - TaskSDKRunner (executor) — Standalone task execution runtime that runs individual tasks with proper isolation and state reporting
task-sdk/src/airflow/task_sdk/runner.py - FastAPIApp (gateway) — Main web application providing REST API endpoints and mounting UI components for external interaction
airflow-core/src/airflow/api_fastapi/app.py - BaseOperator (factory) — Base class for all task operators that defines task execution interface and dependency management
airflow-core/src/airflow/models/baseoperator.py - BaseHook (adapter) — Abstract interface for external system connections used by operators to interact with databases, APIs, and services
airflow-core/src/airflow/hooks/base.py - DagBag (registry) — Container that holds parsed DAG objects in memory with lazy loading and error handling
airflow-core/src/airflow/models/dagbag.py - TaskInstanceKeyType (transformer) — Serializable identifier for task instances used in executor queues and cross-process communication
airflow-core/src/airflow/models/taskinstance.py - XComManager (store) — Cross-communication system for passing data between tasks with pluggable backends
airflow-core/src/airflow/models/xcom.py
Package Structure
Core orchestration engine with scheduler, executor, webserver, and DAG processing capabilities
Command-line management tool for controlling Airflow deployments
Task execution SDK for writing and running Airflow tasks
Extensible provider ecosystem with 70+ integrations for external systems
Development tools including Breeze CLI for local testing and build automation
Common utilities shared across packages for configuration, logging, and serialization
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Repository Repositories
Frequently Asked Questions
What is airflow used for?
Orchestrates distributed DAG execution with pluggable task scheduling and monitoring apache/airflow is a 10-component repository written in Python. Minimal connections — components operate mostly in isolation. The codebase contains 8016 files.
How is airflow architected?
airflow is organized into 5 architecture layers: Control Plane, Orchestration Core, Task Execution, Provider Ecosystem, and 1 more. Minimal connections — components operate mostly in isolation. This layered structure keeps concerns separated and modules independent.
How does data flow through airflow?
Data moves through 6 stages: DAG Discovery and Parsing → Schedule Evaluation → Task Instance Creation → Task Dispatching → Task Execution → .... DAG files are parsed by DagFileProcessorManager into DAG objects stored in DagBag. SchedulerJobRunner evaluates schedules to create DagRuns and TaskInstances, which are queued to BaseExecutor implementations. Executors dispatch tasks to workers (local processes, Celery, Kubernetes pods) where TaskSDKRunner executes the actual task code. Task state changes flow back through the metadata database to update TaskInstance records. This pipeline design reflects a complex multi-stage processing system.
What technologies does airflow use?
The core stack includes FastAPI (Modern web framework providing REST API endpoints and automatic OpenAPI documentation), SQLAlchemy (ORM layer managing metadata database operations and schema migrations), Celery (Distributed task queue system for scaling task execution across multiple workers), Kubernetes (Container orchestration platform for running tasks as isolated pods), Pydantic (Data validation and serialization for API request/response models), Alembic (Database migration tool for managing schema version changes), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does airflow have?
airflow exhibits 4 data pools (Metadata Database, DagBag Cache), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle polling and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does airflow use?
5 design patterns detected: Plugin Architecture, Multiprocessing DAG Parsing, Pluggable Executors, XCom Data Exchange, SQLAlchemy ORM Layer.
Analyzed on April 19, 2026 by CodeSea. Written by Karolina Sarna.