sodadata/soda-core

Data Contracts engine for the modern data stack. https://www.soda.io

2,317 stars Python 10 components 9 connections

Multi-datasource data quality and contract verification engine with YAML-based configuration

YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud

Under the hood, the system uses 2 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.

Structural Verdict

A 10-component data pipeline with 9 connections. 263 files analyzed. Well-connected — clear data flow between components.

How Data Flows Through the System

YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud

  1. Load YAML Contract — Parse data contract YAML files into structured check definitions
  2. Generate SQL AST — Convert checks into database-agnostic SQL Abstract Syntax Tree
  3. Render Database SQL — Transform AST into database-specific SQL using dialect adapters
  4. Execute Queries — Run SQL queries against target database via connection adapters
  5. Process Results — Aggregate query results and validate against contract expectations
  6. Report Outcomes — Send results to Soda Cloud or return locally for pipeline integration

System Behavior

How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Soda Cloud (database)
External cloud service storing check results, contracts, and monitoring data
Database Connections (database)
Target databases being validated - BigQuery, Snowflake, PostgreSQL, etc.
YAML Contract Files (file-store)
Local or remote YAML files defining data contracts and quality checks

Feedback Loops

Delays & Async Processing

Control Points

Package Structure

This monorepo contains 14 packages:

soda-core (library)
Core data contracts and quality verification engine with CLI, YAML parsing, SQL generation, and cloud integration.
soda-tests (tooling)
Test suite for the entire Soda ecosystem with fixtures, helpers, and integration tests.
soda-athena (library)
AWS Athena data source adapter implementing S3-backed query execution.
soda-bigquery (library)
Google BigQuery data source adapter with project/dataset namespace support.
soda-databricks (library)
Databricks data source adapter with Hive metadata support and Delta Lake integration.
soda-duckdb (library)
DuckDB data source adapter supporting both file-based and in-memory databases.
soda-fabric (library)
Microsoft Fabric data source adapter extending SQL Server functionality.
soda-postgres (library)
PostgreSQL data source adapter with comprehensive SSL and authentication support.
soda-redshift (library)
Amazon Redshift data source adapter with AWS IAM integration.
soda-snowflake (library)
Snowflake data source adapter supporting multiple authentication methods.
soda-sparkdf (library)
Apache Spark DataFrame data source adapter for in-memory data validation.
soda-sqlserver (library)
Microsoft SQL Server data source adapter with Active Directory authentication.
soda-synapse (library)
Azure Synapse Analytics data source adapter.
soda-trino (library)
Trino data source adapter with OAuth2 authentication support.

Technology Stack

Pydantic (library)
Configuration validation and data models
PyProject.toml (build)
Package management and workspace configuration
UV (build)
Fast Python package manager
psycopg (database)
PostgreSQL database connectivity
google-cloud-bigquery (database)
BigQuery database connectivity
snowflake-connector-python (database)
Snowflake database connectivity
pytest (testing)
Testing framework
black (build)
Code formatting

Key Components

Configuration

soda-bigquery/src/soda_bigquery/common/data_sources/bigquery_data_source.py (python-dataclass)

soda-core/src/soda_core/common/metadata_types.py (python-dataclass)

soda-core/src/soda_core/common/metadata_types.py (python-dataclass)

soda-core/src/soda_core/common/soda_cloud_dto.py (python-pydantic)

Science Pipeline

  1. Parse Contract YAML — YAML parsing with schema validation [YAML text → ContractVerification objects] soda-core/src/soda_core/contracts/contract_verification.py
  2. Build SQL AST — Convert checks to database-agnostic SQL tree [ContractVerification → SQL AST nodes] soda-core/src/soda_core/common/sql_ast.py
  3. Render Database SQL — Transform AST using dialect-specific rules [SQL AST → Database-specific SQL strings] soda-core/src/soda_core/common/sql_dialect.py
  4. Execute Database Queries — Run SQL against target database [SQL strings → QueryResult objects with rows] soda-core/src/soda_core/common/data_source_connection.py
  5. Validate Results — Compare actual vs expected values [QueryResult → Check pass/fail status] soda-core/src/soda_core/contracts/impl/contract_verification_impl.py

Assumptions & Constraints

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Data Pipeline Repositories

Frequently Asked Questions

What is soda-core used for?

Multi-datasource data quality and contract verification engine with YAML-based configuration sodadata/soda-core is a 10-component data pipeline written in Python. Well-connected — clear data flow between components. The codebase contains 263 files.

How is soda-core architected?

soda-core is organized into 4 architecture layers: Core Engine, Data Source Adapters, SQL Dialect System, Testing Framework. Well-connected — clear data flow between components. This layered structure enables tight integration between components.

How does data flow through soda-core?

Data moves through 6 stages: Load YAML Contract → Generate SQL AST → Render Database SQL → Execute Queries → Process Results → .... YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud This pipeline design reflects a complex multi-stage processing system.

What technologies does soda-core use?

The core stack includes Pydantic (Configuration validation and data models), PyProject.toml (Package management and workspace configuration), UV (Fast Python package manager), psycopg (PostgreSQL database connectivity), google-cloud-bigquery (BigQuery database connectivity), snowflake-connector-python (Snowflake database connectivity), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does soda-core have?

soda-core exhibits 3 data pools (Soda Cloud, Database Connections), 2 feedback loops, 3 control points, 2 delays. The feedback loops handle polling and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does soda-core use?

4 design patterns detected: Adapter Pattern, Plugin System, SQL AST Builder, Pydantic Configuration.

Analyzed on March 31, 2026 by CodeSea. Written by .