airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

21,107 stars Python 8 components

Orchestrates data movement from APIs, databases and files to warehouses and lakes

Data flows from CLI arguments through configuration parsing and validation into connector operations. The ConnectorCommandLinePropertySource transforms CLI args into Micronaut properties, factories create validated configuration and catalog objects, and the AirbyteConnectorRunnable executes the specific operation (check, discover, read, write) using the configured instances. State flows in for incremental syncs and flows out as checkpoints during execution.

Under the hood, the system uses 2 feedback loops, 2 data pools, 3 control points to manage its runtime behavior.

A 8-component data pipeline. 5156 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Data flows from CLI arguments through configuration parsing and validation into connector operations. The ConnectorCommandLinePropertySource transforms CLI args into Micronaut properties, factories create validated configuration and catalog objects, and the AirbyteConnectorRunnable executes the specific operation (check, discover, read, write) using the configured instances. State flows in for incremental syncs and flows out as checkpoints during execution.

  1. Parse CLI arguments into Micronaut properties — ConnectorCommandLinePropertySource extracts operation type and file paths from CLI args and converts to configuration properties like airbyte.connector.config.json
  2. Load and validate connector configuration — ConfigurationSpecificationSupplier parses JSON config, validates against JSON schema, and creates typed configuration POJOs
  3. Parse and validate configured catalog — ConfiguredCatalogFactory parses catalog JSON, validates stream names and sync modes, creates ConfiguredAirbyteCatalog with stream configurations
  4. Parse input state for incremental syncs — InputStateFactory transforms AirbyteStateMessage list into typed InputState (Empty, Global, or Stream-based) for resuming incremental operations [AirbyteStateMessage → InputState]
  5. Execute connector operation — AirbyteConnectorRunnable gets Operation instance from Micronaut, executes with error handling, and outputs protocol messages via OutputConsumer [ConfigurationSpecification]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ConfiguredAirbyteCatalog airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/ConfiguredCatalogFactory.kt
Protocol object containing list of ConfiguredAirbyteStream with stream metadata, sync modes, and destination sync mode configurations
Parsed from JSON input during connector initialization, validated for required fields, passed to sync operations
InputState airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/InputState.kt
Sealed interface with EmptyInputState, GlobalInputState(global: JsonNode, globalStreams: Map<StreamIdentifier, JsonNode>), or StreamInputState(streams: Map<StreamIdentifier, JsonNode>)
Restored from persistent storage at sync start, updated during data extraction, persisted for next incremental run
StreamIdentifier airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/StreamIdentifier.kt
Data class with namespace: String? and name: String identifying a data stream uniquely across namespaces
Created from AirbyteStream or StreamDescriptor objects, used as map keys throughout sync operations
ConfigurationSpecification airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/ConfigurationSpecificationSupplier.kt
Generic connector configuration POJO with connector-specific fields like host, credentials, batch sizes - serialized to/from JSON with JSON Schema validation
Loaded from JSON or Micronaut properties, validated against JSON schema, injected into connector components
AirbyteStateMessage airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/InputStateFactory.kt
Protocol message with type field and state data for global or per-stream incremental sync checkpointing
Parsed from JSON state input, transformed to InputState sealed types, emitted during sync execution

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Resource unguarded

metadata.yaml resource exists and is readable at startup time in classpath root

If this fails: ResourceNotFoundException causing immediate connector failure with no graceful degradation

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/MetadataYamlPropertySource.kt:loadFromResource
critical Shape weakly guarded

micronautPropertiesFallback object is serializable to valid JSON and matches the same schema as jsonPropertyValue

If this fails: ConfigErrorException with misleading error message when fallback serialization fails or schema mismatch causes validation errors

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/ConfigurationSpecificationSupplier.kt:get
critical Contract weakly guarded

CommandLine.optionValue() returns non-null for options that were actually provided on command line

If this fails: Silent null values passed to downstream components or incorrect operation selection when CLI parsing is inconsistent

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/ConnectorCommandLinePropertySource.kt:resolveValues
warning Temporal unguarded

operationProvider.get() is idempotent and safe to call multiple times during error recovery

If this fails: Resource leaks or initialization side effects if operation creation is retried after partial failure

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/AirbyteConnectorRunnable.kt:run
warning Environment unguarded

System.getenv() and Instant.now() are available and functional when creating offset clock in test environment

If this fails: Clock factory fails to initialize in containerized or restricted environments where system time access is limited

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/ClockFactory.kt:offset
warning Shape weakly guarded

AirbyteStateMessage objects with type=null are always safe to filter out and represent empty/initial state

If this fails: Loss of valid state data if null type actually represents a legitimate state format from older protocol versions

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/InputStateFactory.kt:make
warning Contract unguarded

Operation.execute() handles its own error reporting and never throws exceptions that should be propagated to users differently than ConfigErrorException

If this fails: Important connector-specific errors get wrapped in generic failure messages, losing debugging context

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/AirbyteConnectorRunnable.kt:run
warning Ordering weakly guarded

CLI operations are mutually exclusive and exactly one operation will be specified per invocation

If this fails: Undefined behavior if multiple operations somehow get through validation, potentially executing wrong operation

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/command/ConnectorCommandLinePropertySource.kt:resolveValues
info Domain unguarded

Epoch timestamp 3133641600 (year 2069) is always safe as a future test timestamp that won't conflict with real data

If this fails: Test failures or incorrect behavior in systems that validate timestamps against reasonable ranges or perform date arithmetic

airbyte-cdk/bulk/core/base/src/main/kotlin/io/airbyte/cdk/ClockFactory.kt:fakeNow
info Environment weakly guarded

window.analytics Segment SDK is injected by Cloudflare and available when vote tracking executes

If this fails: Silent tracking failure with no user feedback when analytics is unavailable, potentially losing valuable user feedback data

docusaurus/src/theme/TOCItems/index.js:onVote

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Connector metadata registry (registry)
Stores connector identity, version, and capability information loaded at startup
Micronaut application context (registry)
Dependency injection container holding configured instances of connectors, operations, and supporting services

Feedback Loops

Delays

Control Points

Technology Stack

Kotlin/Java (runtime)
Primary language for connector framework and implementation with null safety and functional constructs
Micronaut (framework)
Dependency injection and application framework providing property binding and environment-based configuration
Gradle (build)
Build automation with multi-project support for CDK, connectors, and documentation components
Jackson (serialization)
JSON parsing and serialization for protocol messages and configuration with schema validation
Docusaurus (framework)
Documentation site generation with React components for connector specifications and tutorials
PicoCLI (library)
Command line argument parsing integrated with Micronaut for connector operation selection

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare airbyte

Related Data Pipeline Repositories

Frequently Asked Questions

What is airbyte used for?

Orchestrates data movement from APIs, databases and files to warehouses and lakes airbytehq/airbyte is a 8-component data pipeline written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 5156 files.

How is airbyte architected?

airbyte is organized into 3 architecture layers: Connector Development Kit, Connector Implementations, Documentation Platform. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through airbyte?

Data moves through 5 stages: Parse CLI arguments into Micronaut properties → Load and validate connector configuration → Parse and validate configured catalog → Parse input state for incremental syncs → Execute connector operation. Data flows from CLI arguments through configuration parsing and validation into connector operations. The ConnectorCommandLinePropertySource transforms CLI args into Micronaut properties, factories create validated configuration and catalog objects, and the AirbyteConnectorRunnable executes the specific operation (check, discover, read, write) using the configured instances. State flows in for incremental syncs and flows out as checkpoints during execution. This pipeline design reflects a complex multi-stage processing system.

What technologies does airbyte use?

The core stack includes Kotlin/Java (Primary language for connector framework and implementation with null safety and functional constructs), Micronaut (Dependency injection and application framework providing property binding and environment-based configuration), Gradle (Build automation with multi-project support for CDK, connectors, and documentation components), Jackson (JSON parsing and serialization for protocol messages and configuration with schema validation), Docusaurus (Documentation site generation with React components for connector specifications and tutorials), PicoCLI (Command line argument parsing integrated with Micronaut for connector operation selection). A focused set of dependencies that keeps the build manageable.

What system dynamics does airbyte have?

airbyte exhibits 2 data pools (Connector metadata registry, Micronaut application context), 2 feedback loops, 3 control points, 2 delays. The feedback loops handle retry and checkpoint-save. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does airbyte use?

4 design patterns detected: Micronaut Dependency Injection Factory Pattern, Sealed Interface State Machine, Protocol Message Transformation Pipeline, Exception-based Error Classification.

How does airbyte compare to alternatives?

CodeSea has side-by-side architecture comparisons of airbyte with meltano. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 19, 2026 by CodeSea. Written by .