great-expectations/great_expectations
Always know what to expect from your data.
Validates data quality through configurable expectations with multi-backend execution
Users define expectations through DataContext, which manages datasources that load data into Batches. Validators execute expectations against batches using ExecutionEngines that compute metrics. Results flow to Renderers for documentation and Stores for persistence. Checkpoints orchestrate this pipeline for production workflows.
Under the hood, the system uses 3 feedback loops, 4 data pools, 4 control points to manage its runtime behavior.
A 8-component library. 1728 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Users define expectations through DataContext, which manages datasources that load data into Batches. Validators execute expectations against batches using ExecutionEngines that compute metrics. Results flow to Renderers for documentation and Stores for persistence. Checkpoints orchestrate this pipeline for production workflows.
- Initialize DataContext — AbstractDataContext loads configuration from stores, initializes datasources, and sets up validation infrastructure [DataContextConfig → DataContext]
- Define Expectations — Users create ExpectationConfiguration objects defining data quality rules via expect_* methods on Validator instances
- Load Data Batches — Datasource queries data sources and creates Batch objects containing data and metadata using BatchDefinition [BatchRequest → Batch]
- Execute Validations — Validator.validate() converts expectations to MetricConfiguration, ExecutionEngine computes metrics, results compared to expectation thresholds [ExpectationConfiguration → ValidationResult]
- Aggregate Results — Checkpoint collects ValidationResults from multiple batches and expectations into CheckpointResult [ValidationResult → CheckpointResult]
- Render Documentation — Renderer transforms ValidationResults into HTML reports, data docs, and other human-readable formats [ValidationResult → RenderedContent]
- Persist Results — Store implementations save ValidationResults, ExpectationSuites, and documentation to filesystem, databases, or cloud storage [Serializable objects]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
great_expectations/core/expectation_configuration.pydict with expectation_type: str, kwargs: Dict[str, Any], meta: Dict[str, Any]
Created by users or auto-generated, stored in context, executed by validators, results rendered to documentation
great_expectations/core/expectation_validation_result.pydict with success: bool, result: Dict[str, Any], exception_info: Optional[Dict], meta: Dict[str, Any]
Generated by expectation execution, aggregated into checkpoint results, rendered for documentation or stored for monitoring
great_expectations/core/batch.pyobject with data: Union[DataFrame, Dataset], batch_definition: BatchDefinition, batch_spec: BatchSpec
Created from datasource queries, passed to execution engines, used by validators to run expectations against specific data subsets
great_expectations/core/metric_domain_types.pydict with metric_name: str, metric_domain_kwargs: Dict, metric_value_kwargs: Dict
Derived from expectation configurations, executed by metrics engines, results used for expectation validation
great_expectations/data_context/data_context/abstract_data_context.pyobject with config: DataContextConfig, datasources: Dict, stores: Dict, checkpoints: Dict
Instantiated at startup, manages all GX resources, persists configuration and validation results
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
useThemeConfig().gxCard exists and has the structure {title: string, description: string, buttons: {primary: {href: string, label: string}, secondary: {href: string, label: string}}}
If this fails: Runtime error 'Cannot read properties of undefined' when gxCard is not configured in theme config, breaking card rendering
docs/docusaurus/src/components/GXCard/index.js:useGXCardConfig
GitHub API at 'https://api.github.com/repos/${owner}/${repository}' returns JSON with stargazers_count and forks_count numeric fields
If this fails: formatCompactNumber crashes with TypeError if API returns null/string for counts, or component displays 'NaN' for invalid numeric values
docs/docusaurus/src/components/GithubNavbarItem/index.js:useEffect
GitHub API is accessible and responds within reasonable time without CORS issues
If this fails: Component renders without star/fork counts and setShowGithubBadgeInfo(false) hides badge info, but no visible error to user about network failure
docs/docusaurus/src/components/GithubNavbarItem/index.js:fetch
Netlify Functions endpoint '/.netlify/functions/createJiraTicketInDocsBoard' exists and accepts the form data structure
If this fails: Feedback submission fails with 404 or 500 errors if endpoint is missing or expects different data shape, silently failing user feedback
docs/docusaurus/src/components/WasThisHelpful/index.js:CREATE_JIRA_TICKET_IN_DOCS_BOARD_ENDPOINT_URL
Form inputs have 'name' attribute matching formData keys (name, email, selectedValue, description)
If this fails: Form state becomes inconsistent if input name attributes don't match, causing submission to send undefined/old values
docs/docusaurus/src/components/WasThisHelpful/index.js:handleChange
Intl.NumberFormat with 'compact' notation is supported in all target browsers
If this fails: TypeError in older browsers that don't support compact notation, breaking the entire navbar component
docs/docusaurus/src/components/GithubNavbarItem/index.js:formatCompactNumber
announcementBar.content is safe HTML string that won't execute malicious scripts
If this fails: XSS vulnerability if content contains malicious JavaScript, allowing arbitrary code execution in user browsers
docs/docusaurus/src/theme/AnnouncementBar/Content/index.js:dangerouslySetInnerHTML
DOM access is available and buttonElement's parent contains code with 'code-block-hide-line' class elements
If this fails: Copy function falls back to originalCode with potentially sensitive hidden lines visible in clipboard if DOM traversal fails
docs/docusaurus/src/theme/CodeBlock/Buttons/CopyButton/index.js:filterHiddenLines
The 'to' prop starts with '/' and useVersionedPath hook correctly resolves versioned paths from current page context
If this fails: Navigation links break with incorrect paths if 'to' prop doesn't start with '/' or version context is unavailable
docs/docusaurus/src/components/VersionedLink/index.js:useVersionedPath
Icon URLs in 'icon' prop are accessible and load successfully
If this fails: Broken image icons display if URLs are invalid, but component continues to function without visual feedback about the failure
docs/docusaurus/src/components/LinkCard/index.js:VersionedLink
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Persists ExpectationSuite definitions for reuse across validation runs
Accumulates historical validation results for monitoring and trend analysis
Stores checkpoint configurations for repeatable validation workflows
Caches computed metrics to avoid redundant computation across similar expectations
Feedback Loops
- Expectation Refinement (self-correction, balancing) — Trigger: Validation failures or unexpected results. Action: Users adjust expectation parameters based on validation outcomes. Exit: Expectations match data reality.
- Profiler Iteration (convergence, balancing) — Trigger: Data profiling to automatically generate expectations. Action: ProfilerRunner iteratively refines expectation parameters based on data samples. Exit: Statistical convergence or iteration limit reached.
- Checkpoint Retry (retry, balancing) — Trigger: Transient failures in data access or computation. Action: Checkpoint retries failed validations with exponential backoff. Exit: Success or maximum retry count reached.
Delays
- Lazy Metric Computation (async-processing, ~Variable based on data size) — Metrics computed only when needed by expectations, improving performance for large datasets
- Store Synchronization (eventual-consistency, ~Depends on storage backend) — Stores may have eventual consistency, affecting immediate availability of saved artifacts
- Data Source Connection (warmup, ~1-30 seconds) — Initial connection establishment to databases or cloud storage before data access
Control Points
- progress_bars (feature-flag) — Controls: Whether to display progress indicators during long-running operations. Default: True
- result_format (runtime-toggle) — Controls: Detail level of validation results (BOOLEAN_ONLY, BASIC, COMPLETE, SUMMARY). Default: SUMMARY
- catch_exceptions (runtime-toggle) — Controls: Whether to catch and return exceptions or let them propagate. Default: True
- evaluation_parameters (runtime-toggle) — Controls: Dynamic parameter substitution in expectation configurations. Default: null
Technology Stack
Provides type-safe configuration models and validation for DataContext and other core objects
Database abstraction layer for SQL datasources and stores
Primary data manipulation engine for in-memory data processing
Distributed computing engine for large-scale data validation
Template engine for generating documentation and reports from validation results
Command-line interface framework for GX CLI tools
Schema serialization and deserialization for configuration objects
Testing framework for unit and integration tests across the codebase
Key Components
- AbstractDataContext (orchestrator) — Central coordinator that manages datasources, expectations, checkpoints, and validation workflows
great_expectations/data_context/data_context/abstract_data_context.py - Validator (executor) — Executes expectations against data batches using configured execution engines
great_expectations/validator/validator.py - ExecutionEngine (adapter) — Abstracts computation across pandas, Spark, SQL databases for running metrics and validations
great_expectations/execution_engine/execution_engine.py - Expectation (validator) — Defines data quality validation logic and converts to executable metric configurations
great_expectations/expectations/expectation.py - Checkpoint (orchestrator) — Orchestrates batch validation workflows across multiple data assets with consistent configuration
great_expectations/checkpoint/checkpoint.py - Datasource (adapter) — Provides unified interface to different data sources and manages data asset discovery
great_expectations/datasource/fluent/datasource.py - Store (store) — Persists expectations, validation results, and other GX artifacts to various backends
great_expectations/data_context/store/store.py - Renderer (transformer) — Converts validation results into human-readable documentation and reports
great_expectations/render/renderer/renderer.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is great_expectations used for?
Validates data quality through configurable expectations with multi-backend execution great-expectations/great_expectations is a 8-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 1728 files.
How is great_expectations architected?
great_expectations is organized into 6 architecture layers: Data Context Layer, Expectation Layer, Execution Engine Layer, Validator Layer, and 2 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through great_expectations?
Data moves through 7 stages: Initialize DataContext → Define Expectations → Load Data Batches → Execute Validations → Aggregate Results → .... Users define expectations through DataContext, which manages datasources that load data into Batches. Validators execute expectations against batches using ExecutionEngines that compute metrics. Results flow to Renderers for documentation and Stores for persistence. Checkpoints orchestrate this pipeline for production workflows. This pipeline design reflects a complex multi-stage processing system.
What technologies does great_expectations use?
The core stack includes Pydantic (Provides type-safe configuration models and validation for DataContext and other core objects), SQLAlchemy (Database abstraction layer for SQL datasources and stores), Pandas (Primary data manipulation engine for in-memory data processing), Apache Spark (Distributed computing engine for large-scale data validation), Jinja2 (Template engine for generating documentation and reports from validation results), Click (Command-line interface framework for GX CLI tools), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does great_expectations have?
great_expectations exhibits 4 data pools (ExpectationStore, ValidationResultsStore), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle self-correction and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does great_expectations use?
5 design patterns detected: Plugin Architecture, Strategy Pattern, Builder Pattern, Template Method, Observer Pattern.
Analyzed on April 19, 2026 by CodeSea. Written by Karolina Sarna.