sodadata/soda-core

Data Contracts engine for the modern data stack. https://www.soda.io

2,317 stars Python 10 components 9 connections

Multi-datasource data quality and contract verification engine with YAML-based configuration

YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud

Under the hood, the system uses 2 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.

A 10-component data pipeline with 9 connections. 263 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Load YAML Contract — Parse data contract YAML files into structured check definitions
Generate SQL AST — Convert checks into database-agnostic SQL Abstract Syntax Tree
Render Database SQL — Transform AST into database-specific SQL using dialect adapters
Execute Queries — Run SQL queries against target database via connection adapters
Process Results — Aggregate query results and validate against contract expectations
Report Outcomes — Send results to Soda Cloud or return locally for pipeline integration

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Soda Cloud (database)
External cloud service storing check results, contracts, and monitoring data

Database Connections (database)
Target databases being validated - BigQuery, Snowflake, PostgreSQL, etc.

YAML Contract Files (file-store)
Local or remote YAML files defining data contracts and quality checks

Feedback Loops

Contract Verification Loop (polling, balancing) — Trigger: Scheduled or pipeline-triggered execution. Action: Execute all checks in contract and compare results to expectations. Exit: All checks pass or maximum retries reached.
Plugin Discovery (recursive, reinforcing) — Trigger: System startup. Action: Scan for and load available data source adapter packages. Exit: All packages loaded or import failures.

Delays

Database Query Execution (async-processing, ~varies by database size and complexity) — Contract verification waits for all SQL queries to complete
Soda Cloud Upload (async-processing, ~network-dependent) — Results may be available locally before cloud synchronization

Control Points

Data Source Type (env-var) — Controls: Which database adapter plugin is loaded and used
Connection Properties (env-var) — Controls: Database connection parameters like host, credentials, timeouts
Logging Level (runtime-toggle) — Controls: Verbosity of system logging and debugging output

Technology Stack

Pydantic (library)
Configuration validation and data models

PyProject.toml (build)
Package management and workspace configuration

UV (build)
Fast Python package manager

psycopg (database)
PostgreSQL database connectivity

google-cloud-bigquery (database)
BigQuery database connectivity

snowflake-connector-python (database)
Snowflake database connectivity

pytest (testing)
Testing framework

black (build)
Code formatting

Key Components

DataSourceImpl (class) — Abstract base class defining interface for all database adapters soda-core/src/soda_core/common/data_source_impl.py
BigQueryDataSourceImpl (class) — BigQuery-specific implementation handling project namespaces and BigQuery SQL dialect soda-bigquery/src/soda_bigquery/common/data_sources/bigquery_data_source.py
PostgresDataSourceImpl (class) — PostgreSQL adapter with metadata queries and regex pattern support soda-postgres/src/soda_postgres/common/data_sources/postgres_data_source.py
SqlDialect (class) — Base class for database-specific SQL generation and type mapping soda-core/src/soda_core/common/sql_dialect.py
ContractVerificationImpl (class) — Core contract verification engine executing checks against YAML contracts soda-core/src/soda_core/contracts/impl/contract_verification_impl.py
DataSourceConnection (class) — Abstract connection manager handling database connectivity and query execution soda-core/src/soda_core/common/data_source_connection.py
SQL_AST (module) — SQL Abstract Syntax Tree classes for building database-agnostic queries soda-core/src/soda_core/common/sql_ast.py
load_plugins (function) — Discovers and registers all available data source adapter packages soda-core/src/soda_core/plugins.py
MetadataTablesQuery (class) — Base class for querying database metadata like table and column information soda-core/src/soda_core/common/statements/metadata_tables_query.py
DuckDBDataSourceImpl (class) — DuckDB adapter supporting both file-based and in-memory database connections soda-duckdb/src/soda_duckdb/common/data_sources/duckdb_data_source.py

Package Structure

soda-core (library)
Core data contracts and quality verification engine with CLI, YAML parsing, SQL generation, and cloud integration.

soda-tests (tooling)
Test suite for the entire Soda ecosystem with fixtures, helpers, and integration tests.

soda-athena (library)
AWS Athena data source adapter implementing S3-backed query execution.

soda-bigquery (library)
Google BigQuery data source adapter with project/dataset namespace support.

soda-databricks (library)
Databricks data source adapter with Hive metadata support and Delta Lake integration.

soda-duckdb (library)
DuckDB data source adapter supporting both file-based and in-memory databases.

soda-fabric (library)
Microsoft Fabric data source adapter extending SQL Server functionality.

soda-postgres (library)
PostgreSQL data source adapter with comprehensive SSL and authentication support.

soda-redshift (library)
Amazon Redshift data source adapter with AWS IAM integration.

soda-snowflake (library)
Snowflake data source adapter supporting multiple authentication methods.

soda-sparkdf (library)
Apache Spark DataFrame data source adapter for in-memory data validation.

soda-sqlserver (library)
Microsoft SQL Server data source adapter with Active Directory authentication.

soda-synapse (library)
Azure Synapse Analytics data source adapter.

soda-trino (library)
Trino data source adapter with OAuth2 authentication support.

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Data Pipeline Repositories

Frequently Asked Questions

What is soda-core used for?

Multi-datasource data quality and contract verification engine with YAML-based configuration sodadata/soda-core is a 10-component data pipeline written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 263 files.

How is soda-core architected?

soda-core is organized into 4 architecture layers: Core Engine, Data Source Adapters, SQL Dialect System, Testing Framework. Data flows through 6 distinct pipeline stages. This layered structure enables tight integration between components.

How does data flow through soda-core?

Data moves through 6 stages: Load YAML Contract → Generate SQL AST → Render Database SQL → Execute Queries → Process Results → .... YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud This pipeline design reflects a complex multi-stage processing system.

What technologies does soda-core use?

The core stack includes Pydantic (Configuration validation and data models), PyProject.toml (Package management and workspace configuration), UV (Fast Python package manager), psycopg (PostgreSQL database connectivity), google-cloud-bigquery (BigQuery database connectivity), snowflake-connector-python (Snowflake database connectivity), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does soda-core have?

soda-core exhibits 3 data pools (Soda Cloud, Database Connections), 2 feedback loops, 3 control points, 2 delays. The feedback loops handle polling and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does soda-core use?

4 design patterns detected: Adapter Pattern, Plugin System, SQL AST Builder, Pydantic Configuration.

Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.

sodadata/soda-core

How Data Flows Through the System

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Package Structure

Explore the interactive analysis

Related Data Pipeline Repositories

redis/redis

crewaiinc/crewai

dandavison/delta

statelyai/xstate

chroma-core/chroma

kestra-io/kestra

Frequently Asked Questions