great-expectations/great_expectations

Always know what to expect from your data.

11,422 stars Python 8 components

Validates data quality through configurable expectations with multi-backend execution

Users define expectations through DataContext, which manages datasources that load data into Batches. Validators execute expectations against batches using ExecutionEngines that compute metrics. Results flow to Renderers for documentation and Stores for persistence. Checkpoints orchestrate this pipeline for production workflows.

Under the hood, the system uses 3 feedback loops, 4 data pools, 4 control points to manage its runtime behavior.

A 8-component library. 1728 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Initialize DataContext — AbstractDataContext loads configuration from stores, initializes datasources, and sets up validation infrastructure [DataContextConfig → DataContext]
Define Expectations — Users create ExpectationConfiguration objects defining data quality rules via expect_* methods on Validator instances
Load Data Batches — Datasource queries data sources and creates Batch objects containing data and metadata using BatchDefinition [BatchRequest → Batch]
Execute Validations — Validator.validate() converts expectations to MetricConfiguration, ExecutionEngine computes metrics, results compared to expectation thresholds [ExpectationConfiguration → ValidationResult]
Aggregate Results — Checkpoint collects ValidationResults from multiple batches and expectations into CheckpointResult [ValidationResult → CheckpointResult]
Render Documentation — Renderer transforms ValidationResults into HTML reports, data docs, and other human-readable formats [ValidationResult → RenderedContent]
Persist Results — Store implementations save ValidationResults, ExpectationSuites, and documentation to filesystem, databases, or cloud storage [Serializable objects]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ExpectationConfiguration great_expectations/core/expectation_configuration.py
dict with expectation_type: str, kwargs: Dict[str, Any], meta: Dict[str, Any]
Created by users or auto-generated, stored in context, executed by validators, results rendered to documentation

ValidationResult great_expectations/core/expectation_validation_result.py
dict with success: bool, result: Dict[str, Any], exception_info: Optional[Dict], meta: Dict[str, Any]
Generated by expectation execution, aggregated into checkpoint results, rendered for documentation or stored for monitoring

Batch great_expectations/core/batch.py
object with data: Union[DataFrame, Dataset], batch_definition: BatchDefinition, batch_spec: BatchSpec
Created from datasource queries, passed to execution engines, used by validators to run expectations against specific data subsets

MetricConfiguration great_expectations/core/metric_domain_types.py
dict with metric_name: str, metric_domain_kwargs: Dict, metric_value_kwargs: Dict
Derived from expectation configurations, executed by metrics engines, results used for expectation validation

DataContext great_expectations/data_context/data_context/abstract_data_context.py
object with config: DataContextConfig, datasources: Dict, stores: Dict, checkpoints: Dict
Instantiated at startup, manages all GX resources, persists configuration and validation results

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

useThemeConfig().gxCard exists and has the structure {title: string, description: string, buttons: {primary: {href: string, label: string}, secondary: {href: string, label: string}}}

If this fails: Runtime error 'Cannot read properties of undefined' when gxCard is not configured in theme config, breaking card rendering

docs/docusaurus/src/components/GXCard/index.js:useGXCardConfig

critical Environment weakly guarded

GitHub API at 'https://api.github.com/repos/${owner}/${repository}' returns JSON with stargazers_count and forks_count numeric fields

If this fails: formatCompactNumber crashes with TypeError if API returns null/string for counts, or component displays 'NaN' for invalid numeric values

docs/docusaurus/src/components/GithubNavbarItem/index.js:useEffect

warning Resource weakly guarded

GitHub API is accessible and responds within reasonable time without CORS issues

If this fails: Component renders without star/fork counts and setShowGithubBadgeInfo(false) hides badge info, but no visible error to user about network failure

docs/docusaurus/src/components/GithubNavbarItem/index.js:fetch

critical Environment unguarded

Netlify Functions endpoint '/.netlify/functions/createJiraTicketInDocsBoard' exists and accepts the form data structure

If this fails: Feedback submission fails with 404 or 500 errors if endpoint is missing or expects different data shape, silently failing user feedback

docs/docusaurus/src/components/WasThisHelpful/index.js:CREATE_JIRA_TICKET_IN_DOCS_BOARD_ENDPOINT_URL

warning Shape unguarded

Form inputs have 'name' attribute matching formData keys (name, email, selectedValue, description)

If this fails: Form state becomes inconsistent if input name attributes don't match, causing submission to send undefined/old values

docs/docusaurus/src/components/WasThisHelpful/index.js:handleChange

warning Domain unguarded

Intl.NumberFormat with 'compact' notation is supported in all target browsers

If this fails: TypeError in older browsers that don't support compact notation, breaking the entire navbar component

docs/docusaurus/src/components/GithubNavbarItem/index.js:formatCompactNumber

critical Shape unguarded

announcementBar.content is safe HTML string that won't execute malicious scripts

If this fails: XSS vulnerability if content contains malicious JavaScript, allowing arbitrary code execution in user browsers

docs/docusaurus/src/theme/AnnouncementBar/Content/index.js:dangerouslySetInnerHTML

warning Environment weakly guarded

DOM access is available and buttonElement's parent contains code with 'code-block-hide-line' class elements

If this fails: Copy function falls back to originalCode with potentially sensitive hidden lines visible in clipboard if DOM traversal fails

docs/docusaurus/src/theme/CodeBlock/Buttons/CopyButton/index.js:filterHiddenLines

warning Contract unguarded

The 'to' prop starts with '/' and useVersionedPath hook correctly resolves versioned paths from current page context

If this fails: Navigation links break with incorrect paths if 'to' prop doesn't start with '/' or version context is unavailable

docs/docusaurus/src/components/VersionedLink/index.js:useVersionedPath

info Resource unguarded

Icon URLs in 'icon' prop are accessible and load successfully

If this fails: Broken image icons display if URLs are invalid, but component continues to function without visual feedback about the failure

docs/docusaurus/src/components/LinkCard/index.js:VersionedLink

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

ExpectationStore (file-store)
Persists ExpectationSuite definitions for reuse across validation runs

ValidationResultsStore (database)
Accumulates historical validation results for monitoring and trend analysis

CheckpointStore (file-store)
Stores checkpoint configurations for repeatable validation workflows

MetricStore (cache)
Caches computed metrics to avoid redundant computation across similar expectations

Feedback Loops

Expectation Refinement (self-correction, balancing) — Trigger: Validation failures or unexpected results. Action: Users adjust expectation parameters based on validation outcomes. Exit: Expectations match data reality.
Profiler Iteration (convergence, balancing) — Trigger: Data profiling to automatically generate expectations. Action: ProfilerRunner iteratively refines expectation parameters based on data samples. Exit: Statistical convergence or iteration limit reached.
Checkpoint Retry (retry, balancing) — Trigger: Transient failures in data access or computation. Action: Checkpoint retries failed validations with exponential backoff. Exit: Success or maximum retry count reached.

Delays

Lazy Metric Computation (async-processing, ~Variable based on data size) — Metrics computed only when needed by expectations, improving performance for large datasets
Store Synchronization (eventual-consistency, ~Depends on storage backend) — Stores may have eventual consistency, affecting immediate availability of saved artifacts
Data Source Connection (warmup, ~1-30 seconds) — Initial connection establishment to databases or cloud storage before data access

Control Points

progress_bars (feature-flag) — Controls: Whether to display progress indicators during long-running operations. Default: True
result_format (runtime-toggle) — Controls: Detail level of validation results (BOOLEAN_ONLY, BASIC, COMPLETE, SUMMARY). Default: SUMMARY
catch_exceptions (runtime-toggle) — Controls: Whether to catch and return exceptions or let them propagate. Default: True
evaluation_parameters (runtime-toggle) — Controls: Dynamic parameter substitution in expectation configurations. Default: null

Technology Stack

Pydantic (serialization)
Provides type-safe configuration models and validation for DataContext and other core objects

SQLAlchemy (database)
Database abstraction layer for SQL datasources and stores

Pandas (compute)
Primary data manipulation engine for in-memory data processing

Apache Spark (compute)
Distributed computing engine for large-scale data validation

Jinja2 (framework)
Template engine for generating documentation and reports from validation results

Click (framework)
Command-line interface framework for GX CLI tools

Marshmallow (serialization)
Schema serialization and deserialization for configuration objects

Pytest (testing)
Testing framework for unit and integration tests across the codebase

Key Components

AbstractDataContext (orchestrator) — Central coordinator that manages datasources, expectations, checkpoints, and validation workflows great_expectations/data_context/data_context/abstract_data_context.py
Validator (executor) — Executes expectations against data batches using configured execution engines great_expectations/validator/validator.py
ExecutionEngine (adapter) — Abstracts computation across pandas, Spark, SQL databases for running metrics and validations great_expectations/execution_engine/execution_engine.py
Expectation (validator) — Defines data quality validation logic and converts to executable metric configurations great_expectations/expectations/expectation.py
Checkpoint (orchestrator) — Orchestrates batch validation workflows across multiple data assets with consistent configuration great_expectations/checkpoint/checkpoint.py
Datasource (adapter) — Provides unified interface to different data sources and manages data asset discovery great_expectations/datasource/fluent/datasource.py
Store (store) — Persists expectations, validation results, and other GX artifacts to various backends great_expectations/data_context/store/store.py
Renderer (transformer) — Converts validation results into human-readable documentation and reports great_expectations/render/renderer/renderer.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is great_expectations used for?

Validates data quality through configurable expectations with multi-backend execution great-expectations/great_expectations is a 8-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 1728 files.

How is great_expectations architected?

great_expectations is organized into 6 architecture layers: Data Context Layer, Expectation Layer, Execution Engine Layer, Validator Layer, and 2 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through great_expectations?

Data moves through 7 stages: Initialize DataContext → Define Expectations → Load Data Batches → Execute Validations → Aggregate Results → .... Users define expectations through DataContext, which manages datasources that load data into Batches. Validators execute expectations against batches using ExecutionEngines that compute metrics. Results flow to Renderers for documentation and Stores for persistence. Checkpoints orchestrate this pipeline for production workflows. This pipeline design reflects a complex multi-stage processing system.

What technologies does great_expectations use?

The core stack includes Pydantic (Provides type-safe configuration models and validation for DataContext and other core objects), SQLAlchemy (Database abstraction layer for SQL datasources and stores), Pandas (Primary data manipulation engine for in-memory data processing), Apache Spark (Distributed computing engine for large-scale data validation), Jinja2 (Template engine for generating documentation and reports from validation results), Click (Command-line interface framework for GX CLI tools), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does great_expectations have?

great_expectations exhibits 4 data pools (ExpectationStore, ValidationResultsStore), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle self-correction and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does great_expectations use?

5 design patterns detected: Plugin Architecture, Strategy Pattern, Builder Pattern, Template Method, Observer Pattern.

Analyzed on April 19, 2026 by CodeSea. Written by Karolina Sarna.