databendlabs/databend

Data Agent Ready Warehouse : One for Analytics, Search, AI, Python Sandbox. — rebuilt from scratch. Unified architecture on your S3.

9,220 stars Rust 10 components 6 connections

Cloud-native data warehouse built in Rust with vector search, full-text search, and agent orchestration

Data flows from external sources through the query engine's planning and execution layers, with metadata managed by the distributed meta layer, and results returned via Arrow format

Under the hood, the system uses 3 feedback loops, 3 data pools, 3 control points to manage its runtime behavior.

A 10-component dashboard with 6 connections. 3733 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Data flows from external sources through the query engine's planning and execution layers, with metadata managed by the distributed meta layer, and results returned via Arrow format

SQL Input — SQL queries received via Python API or network protocols
Query Planning — SQL parsed and optimized into execution plans by Planner
Meta Lookup — Schema and cluster metadata retrieved from meta layer
Data Execution — Query executed against storage backends with parallel processing
Result Format — Results formatted as Arrow format or DataBlocks for return

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Meta Store (state-store)
Raft-based distributed metadata storage for schema and cluster state

Query Results Cache (cache)
Cached query results and intermediate data

Storage Backend (file-store)
Object storage for data files and backups

Feedback Loops

Raft Consensus (convergence, balancing) — Trigger: Meta node state changes. Action: Replicate log entries across meta cluster. Exit: Majority consensus reached.
Query Retry (retry, balancing) — Trigger: Query execution failure. Action: Retry query with backoff. Exit: Success or max retries reached.
Health Check Polling (polling, balancing) — Trigger: Process startup. Action: Check port availability and service health. Exit: Service ready or timeout.

Delays

Cluster Formation (eventual-consistency, ~2-5 seconds) — Meta cluster nodes synchronize before accepting queries
Port Binding Wait (async-processing, ~10 seconds timeout) — Test helper waits for services to bind to ports
Backup Operations (batch-window, ~Variable based on data size) — Cluster backup and restore operations run as batch jobs

Control Points

Embedded Data Path (env-var) — Controls: Storage location for embedded Databend instance. Default: .databend
Build Profile (runtime-toggle) — Controls: Debug vs release binary selection in test helper. Default: null
Progress Reporting (feature-flag) — Controls: Enable/disable startup progress messages. Default: true

Technology Stack

Rust (framework)
Core runtime and query engine

PyO3 (library)
Python binding generation

Tokio (framework)
Async runtime

Apache Arrow (library)
Columnar data format

OpenDAL (library)
Storage abstraction layer

Python (framework)
Test infrastructure and bindings

TOML (library)
Configuration file format

Key Components

init_embedded (function) — Initializes embedded Databend instance for Python usage with configuration and global services src/bendpy/src/lib.rs
PySessionContext (class) — Python-accessible query execution context for running SQL statements src/bendpy/src/context.rs
PyDataFrame (class) — Python wrapper for query results with Arrow format support and display methods src/bendpy/src/dataframe.rs
backup (function) — Creates backups of cluster metadata and data to external storage src/bendsave/src/backup.rs
restore (function) — Restores Databend cluster from backup manifests src/bendsave/src/restore.rs
DatabendCluster (class) — Manages complete Databend clusters with meta and query nodes for testing scripts/databend_test_helper/src/databend_test_helper/cluster.py
GlobalServices (service) — Initializes and manages all query engine services and components src/query/
MetaCluster (class) — Manages multiple databend-meta processes in a Raft cluster configuration scripts/databend_test_helper/src/databend_test_helper/meta_cluster.py
QueryCluster (class) — Manages multiple databend-query processes for distributed query execution scripts/databend_test_helper/src/databend_test_helper/query_cluster.py
GlobalInstance (utility) — Singleton pattern for global Databend instance initialization src/common/base/src/base/singleton_instance.rs

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Dashboard Repositories

Frequently Asked Questions

What is databend used for?

Cloud-native data warehouse built in Rust with vector search, full-text search, and agent orchestration databendlabs/databend is a 10-component dashboard written in Rust. Data flows through 5 distinct pipeline stages. The codebase contains 3733 files.

How is databend architected?

databend is organized into 5 architecture layers: Python API, Query Engine, Meta Layer, Common Libraries, and 1 more. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through databend?

Data moves through 5 stages: SQL Input → Query Planning → Meta Lookup → Data Execution → Result Format. Data flows from external sources through the query engine's planning and execution layers, with metadata managed by the distributed meta layer, and results returned via Arrow format This pipeline design reflects a complex multi-stage processing system.

What technologies does databend use?

The core stack includes Rust (Core runtime and query engine), PyO3 (Python binding generation), Tokio (Async runtime), Apache Arrow (Columnar data format), OpenDAL (Storage abstraction layer), Python (Test infrastructure and bindings), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does databend have?

databend exhibits 3 data pools (Meta Store, Query Results Cache), 3 feedback loops, 3 control points, 3 delays. The feedback loops handle convergence and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does databend use?

5 design patterns detected: Cloud-Native Architecture, Python Bindings, Workspace Organization, Test Infrastructure, Async Runtime.

Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.