databendlabs/databend
Data Agent Ready Warehouse : One for Analytics, Search, AI, Python Sandbox. — rebuilt from scratch. Unified architecture on your S3.
Cloud-native data warehouse built in Rust with vector search, full-text search, and agent orchestration
Data flows from external sources through the query engine's planning and execution layers, with metadata managed by the distributed meta layer, and results returned via Arrow format
Under the hood, the system uses 3 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.
Structural Verdict
A 10-component dashboard with 6 connections. 3733 files analyzed. Loosely coupled — components are relatively independent.
How Data Flows Through the System
Data flows from external sources through the query engine's planning and execution layers, with metadata managed by the distributed meta layer, and results returned via Arrow format
- SQL Input — SQL queries received via Python API or network protocols
- Query Planning — SQL parsed and optimized into execution plans by Planner
- Meta Lookup — Schema and cluster metadata retrieved from meta layer
- Data Execution — Query executed against storage backends with parallel processing
- Result Format — Results formatted as Arrow format or DataBlocks for return
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Raft-based distributed metadata storage for schema and cluster state
Cached query results and intermediate data
Object storage for data files and backups
Feedback Loops
- Raft Consensus (convergence, balancing) — Trigger: Meta node state changes. Action: Replicate log entries across meta cluster. Exit: Majority consensus reached.
- Query Retry (retry, balancing) — Trigger: Query execution failure. Action: Retry query with backoff. Exit: Success or max retries reached.
- Health Check Polling (polling, balancing) — Trigger: Process startup. Action: Check port availability and service health. Exit: Service ready or timeout.
Delays & Async Processing
- Cluster Formation (eventual-consistency, ~2-5 seconds) — Meta cluster nodes synchronize before accepting queries
- Port Binding Wait (async-processing, ~10 seconds timeout) — Test helper waits for services to bind to ports
- Backup Operations (batch-window, ~Variable based on data size) — Cluster backup and restore operations run as batch jobs
Control Points
- Embedded Data Path (env-var) — Controls: Storage location for embedded Databend instance. Default: .databend
- Build Profile (runtime-toggle) — Controls: Debug vs release binary selection in test helper. Default: null
- Progress Reporting (feature-flag) — Controls: Enable/disable startup progress messages. Default: true
Technology Stack
Core runtime and query engine
Python binding generation
Async runtime
Columnar data format
Storage abstraction layer
Test infrastructure and bindings
Configuration file format
Key Components
- init_embedded (function) — Initializes embedded Databend instance for Python usage with configuration and global services
src/bendpy/src/lib.rs - PySessionContext (class) — Python-accessible query execution context for running SQL statements
src/bendpy/src/context.rs - PyDataFrame (class) — Python wrapper for query results with Arrow format support and display methods
src/bendpy/src/dataframe.rs - backup (function) — Creates backups of cluster metadata and data to external storage
src/bendsave/src/backup.rs - restore (function) — Restores Databend cluster from backup manifests
src/bendsave/src/restore.rs - DatabendCluster (class) — Manages complete Databend clusters with meta and query nodes for testing
scripts/databend_test_helper/src/databend_test_helper/cluster.py - GlobalServices (service) — Initializes and manages all query engine services and components
src/query/ - MetaCluster (class) — Manages multiple databend-meta processes in a Raft cluster configuration
scripts/databend_test_helper/src/databend_test_helper/meta_cluster.py - QueryCluster (class) — Manages multiple databend-query processes for distributed query execution
scripts/databend_test_helper/src/databend_test_helper/query_cluster.py - GlobalInstance (utility) — Singleton pattern for global Databend instance initialization
src/common/base/src/base/singleton_instance.rs
Sub-Modules
Python bindings for embedded Databend execution with SessionContext API
Backup and restore CLI tool for cluster data and metadata
Python library for managing Databend clusters during testing
Configuration
benchmark/benchmark_cloud.py (python-dataclass)
benchmark_id(str, unknown)dataset(str, unknown)size(str, unknown)cache_size(str, unknown)version(str, unknown)database(str, unknown)tries(int, unknown)user(str, unknown)- +6 more parameters
benchmark/benchmark_cloud.py (python-dataclass)
date(str, unknown)dataset(str, unknown)database(str, unknown)version(str, unknown)warehouse(str, unknown)machine(str, unknown)tags(List[str], unknown)result(List[List[float]], unknown)- +10 more parameters
tests/nox/suites/copy/conftest.py (python-dataclass)
conn(object, unknown)uniq_name(str, unknown)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Dashboard Repositories
Frequently Asked Questions
What is databend used for?
Cloud-native data warehouse built in Rust with vector search, full-text search, and agent orchestration databendlabs/databend is a 10-component dashboard written in Rust. Loosely coupled — components are relatively independent. The codebase contains 3733 files.
How is databend architected?
databend is organized into 5 architecture layers: Python API, Query Engine, Meta Layer, Common Libraries, and 1 more. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.
How does data flow through databend?
Data moves through 5 stages: SQL Input → Query Planning → Meta Lookup → Data Execution → Result Format. Data flows from external sources through the query engine's planning and execution layers, with metadata managed by the distributed meta layer, and results returned via Arrow format This pipeline design reflects a complex multi-stage processing system.
What technologies does databend use?
The core stack includes Rust (Core runtime and query engine), PyO3 (Python binding generation), Tokio (Async runtime), Apache Arrow (Columnar data format), OpenDAL (Storage abstraction layer), Python (Test infrastructure and bindings), and 1 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does databend have?
databend exhibits 3 data pools (Meta Store, Query Results Cache), 3 feedback loops, 3 control points, 3 delays. The feedback loops handle convergence and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does databend use?
5 design patterns detected: Cloud-Native Architecture, Python Bindings, Workspace Organization, Test Infrastructure, Async Runtime.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.