sodadata/soda-core
Data Contracts engine for the modern data stack. https://www.soda.io
Multi-datasource data quality and contract verification engine with YAML-based configuration
YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud
Under the hood, the system uses 2 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.
Structural Verdict
A 10-component data pipeline with 9 connections. 263 files analyzed. Well-connected — clear data flow between components.
How Data Flows Through the System
YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud
- Load YAML Contract — Parse data contract YAML files into structured check definitions
- Generate SQL AST — Convert checks into database-agnostic SQL Abstract Syntax Tree
- Render Database SQL — Transform AST into database-specific SQL using dialect adapters
- Execute Queries — Run SQL queries against target database via connection adapters
- Process Results — Aggregate query results and validate against contract expectations
- Report Outcomes — Send results to Soda Cloud or return locally for pipeline integration
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
External cloud service storing check results, contracts, and monitoring data
Target databases being validated - BigQuery, Snowflake, PostgreSQL, etc.
Local or remote YAML files defining data contracts and quality checks
Feedback Loops
- Contract Verification Loop (polling, balancing) — Trigger: Scheduled or pipeline-triggered execution. Action: Execute all checks in contract and compare results to expectations. Exit: All checks pass or maximum retries reached.
- Plugin Discovery (recursive, reinforcing) — Trigger: System startup. Action: Scan for and load available data source adapter packages. Exit: All packages loaded or import failures.
Delays & Async Processing
- Database Query Execution (async-processing, ~varies by database size and complexity) — Contract verification waits for all SQL queries to complete
- Soda Cloud Upload (async-processing, ~network-dependent) — Results may be available locally before cloud synchronization
Control Points
- Data Source Type (env-var) — Controls: Which database adapter plugin is loaded and used
- Connection Properties (env-var) — Controls: Database connection parameters like host, credentials, timeouts
- Logging Level (runtime-toggle) — Controls: Verbosity of system logging and debugging output
Package Structure
This monorepo contains 14 packages:
Core data contracts and quality verification engine with CLI, YAML parsing, SQL generation, and cloud integration.
Test suite for the entire Soda ecosystem with fixtures, helpers, and integration tests.
AWS Athena data source adapter implementing S3-backed query execution.
Google BigQuery data source adapter with project/dataset namespace support.
Databricks data source adapter with Hive metadata support and Delta Lake integration.
DuckDB data source adapter supporting both file-based and in-memory databases.
Microsoft Fabric data source adapter extending SQL Server functionality.
PostgreSQL data source adapter with comprehensive SSL and authentication support.
Amazon Redshift data source adapter with AWS IAM integration.
Snowflake data source adapter supporting multiple authentication methods.
Apache Spark DataFrame data source adapter for in-memory data validation.
Microsoft SQL Server data source adapter with Active Directory authentication.
Azure Synapse Analytics data source adapter.
Trino data source adapter with OAuth2 authentication support.
Technology Stack
Configuration validation and data models
Package management and workspace configuration
Fast Python package manager
PostgreSQL database connectivity
BigQuery database connectivity
Snowflake database connectivity
Testing framework
Code formatting
Key Components
- DataSourceImpl (class) — Abstract base class defining interface for all database adapters
soda-core/src/soda_core/common/data_source_impl.py - BigQueryDataSourceImpl (class) — BigQuery-specific implementation handling project namespaces and BigQuery SQL dialect
soda-bigquery/src/soda_bigquery/common/data_sources/bigquery_data_source.py - PostgresDataSourceImpl (class) — PostgreSQL adapter with metadata queries and regex pattern support
soda-postgres/src/soda_postgres/common/data_sources/postgres_data_source.py - SqlDialect (class) — Base class for database-specific SQL generation and type mapping
soda-core/src/soda_core/common/sql_dialect.py - ContractVerificationImpl (class) — Core contract verification engine executing checks against YAML contracts
soda-core/src/soda_core/contracts/impl/contract_verification_impl.py - DataSourceConnection (class) — Abstract connection manager handling database connectivity and query execution
soda-core/src/soda_core/common/data_source_connection.py - SQL_AST (module) — SQL Abstract Syntax Tree classes for building database-agnostic queries
soda-core/src/soda_core/common/sql_ast.py - load_plugins (function) — Discovers and registers all available data source adapter packages
soda-core/src/soda_core/plugins.py - MetadataTablesQuery (class) — Base class for querying database metadata like table and column information
soda-core/src/soda_core/common/statements/metadata_tables_query.py - DuckDBDataSourceImpl (class) — DuckDB adapter supporting both file-based and in-memory database connections
soda-duckdb/src/soda_duckdb/common/data_sources/duckdb_data_source.py
Configuration
soda-bigquery/src/soda_bigquery/common/data_sources/bigquery_data_source.py (python-dataclass)
project_id(str, unknown)dataset(str, unknown)location(Optional[str], unknown) — default: None
soda-core/src/soda_core/common/metadata_types.py (python-dataclass)
database(str, unknown)schema(str, unknown)
soda-core/src/soda_core/common/metadata_types.py (python-dataclass)
name(str, unknown)character_maximum_length(Optional[int], unknown) — default: Nonenumeric_precision(Optional[int], unknown) — default: Nonenumeric_scale(Optional[int], unknown) — default: Nonedatetime_precision(Optional[int], unknown) — default: None
soda-core/src/soda_core/common/soda_cloud_dto.py (python-pydantic)
check_attributes(list[CheckAttribute], unknown) — default: Field(..., alias="resourceAttributes")
Science Pipeline
- Parse Contract YAML — YAML parsing with schema validation [YAML text → ContractVerification objects]
soda-core/src/soda_core/contracts/contract_verification.py - Build SQL AST — Convert checks to database-agnostic SQL tree [ContractVerification → SQL AST nodes]
soda-core/src/soda_core/common/sql_ast.py - Render Database SQL — Transform AST using dialect-specific rules [SQL AST → Database-specific SQL strings]
soda-core/src/soda_core/common/sql_dialect.py - Execute Database Queries — Run SQL against target database [SQL strings → QueryResult objects with rows]
soda-core/src/soda_core/common/data_source_connection.py - Validate Results — Compare actual vs expected values [QueryResult → Check pass/fail status]
soda-core/src/soda_core/contracts/impl/contract_verification_impl.py
Assumptions & Constraints
- [warning] Assumes sampling parameters are positive numbers but no validation enforces reasonable limits (value-range)
- [info] Assumes table names follow database-specific naming conventions but validation varies by adapter (format)
- [info] Assumes column metadata fields like precision/scale are always integers when present (shape)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Data Pipeline Repositories
Frequently Asked Questions
What is soda-core used for?
Multi-datasource data quality and contract verification engine with YAML-based configuration sodadata/soda-core is a 10-component data pipeline written in Python. Well-connected — clear data flow between components. The codebase contains 263 files.
How is soda-core architected?
soda-core is organized into 4 architecture layers: Core Engine, Data Source Adapters, SQL Dialect System, Testing Framework. Well-connected — clear data flow between components. This layered structure enables tight integration between components.
How does data flow through soda-core?
Data moves through 6 stages: Load YAML Contract → Generate SQL AST → Render Database SQL → Execute Queries → Process Results → .... YAML contracts are parsed into check definitions, translated to SQL queries via the AST system, executed against target databases through adapter plugins, with results either returned locally or sent to Soda Cloud This pipeline design reflects a complex multi-stage processing system.
What technologies does soda-core use?
The core stack includes Pydantic (Configuration validation and data models), PyProject.toml (Package management and workspace configuration), UV (Fast Python package manager), psycopg (PostgreSQL database connectivity), google-cloud-bigquery (BigQuery database connectivity), snowflake-connector-python (Snowflake database connectivity), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does soda-core have?
soda-core exhibits 3 data pools (Soda Cloud, Database Connections), 2 feedback loops, 3 control points, 2 delays. The feedback loops handle polling and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does soda-core use?
4 design patterns detected: Adapter Pattern, Plugin System, SQL AST Builder, Pydantic Configuration.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.