chroma-core/chroma

Data infrastructure for AI

27,516 stars Rust 10 components

Manages vector embeddings at scale with distributed search, persistent storage, and collection-based organization

Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client.

Under the hood, the system uses 4 feedback loops, 5 data pools, 6 control points to manage its runtime behavior.

A 10-component data pipeline. 1161 files analyzed. Data flows through 9 distinct pipeline stages.

How Data Flows Through the System

Document ingestion — Client libraries call collection.add() with documents, metadata, and IDs, which get validated and batched by the FastAPI server [Collection → EmbeddingRecord]
Embedding generation — EmbeddingFunction implementations (Ollama, BM25) convert document text to vector representations with configurable model parameters [EmbeddingRecord → EmbeddingRecord]
Log persistence — LogService writes operation records to distributed WAL with sequence numbers and replication across cluster nodes [EmbeddingRecord → LogRecord]
Segment assignment — AssignmentPolicy uses consistent hashing to determine which nodes should store collection segments based on collection ID and cluster topology [Collection → Segment]
Vector indexing — LocalSegment builds HNSW indices from embedding vectors, maintaining approximate nearest neighbor graph structures in memory [EmbeddingRecord]
Block storage — BlockManager persists segment data as Arrow-format blocks with delta compression, managing file I/O and transaction isolation [EmbeddingRecord → BlockDelta]
Query planning — SegmentAPI analyzes query parameters (vector, filters, n_results) and creates execution plans targeting relevant segments [QueryResult]
Distributed search — DistributedExecutor sends query vectors to assigned segments, performs parallel HNSW searches, and collects distance-ranked results
Result aggregation — Query executor merges segment results by distance score, applies global k-limit, and formats response with requested fields (documents, metadata, embeddings) [QueryResult → QueryResult]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Collection chromadb/api/types.py
Python dataclass with id: str, name: str, metadata: Optional[CollectionMetadata], tenant: str, database: str
Created via API with embedding function, stores documents with vector embeddings, queried for similarity search

EmbeddingRecord rust/types/src/record.rs
Rust struct with id: String, embedding: Option<Vec<f32>>, document: Option<String>, metadata: Option<UpdateMetadata>
Generated from user documents during add/upsert operations, persisted in segments, retrieved during queries

Segment chromadb/types.py
Python dataclass with id: SegmentUUID, type: str, scope: SegmentScope, collection: Optional[CollectionUUID], metadata: Optional[Dict]
Created when collection data is partitioned, assigned to nodes for processing, garbage collected when obsolete

QueryResult chromadb/api/types.py
Python dataclass with ids: List[List[str]], distances: Optional[List[List[float]]], metadatas: Optional[List[List[Dict]]], documents: Optional[List[List[str]]], embeddings: Optional[List[List[Embedding]]]
Built by query executors from segment results, ranked by distance, returned to client with requested fields

BlockDelta rust/blockstore/src/arrow/block/delta/types.rs
Rust enum containing DataRecordDelta, QuantizedClusterDelta, or SpannPostingListDelta with operation type and data changes
Created during segment updates, accumulated in memory, flushed to persistent blocks during compaction

LogRecord rust/log/src/log.rs
Rust struct with collection_id: CollectionUuid, log_offset: u64, record: Box<OperationRecord>
Written during every collection operation, replicated to distributed log service, read during segment building and recovery

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

warning Resource unguarded

DEFAULT_NUM_PARTITIONS constant of 32768 is appropriate for all workloads and memory constraints, regardless of system resources or concurrent access patterns

If this fails: On memory-constrained systems, allocating 32K mutexes wastes significant memory; on high-concurrency systems, hash collisions create bottlenecks and false sharing

rust/cache/src/async_partitioned_mutex.rs:DEFAULT_NUM_PARTITIONS

critical Shape unguarded

All embedding functions return embeddings in the same order as input strings, with exact 1:1 correspondence and no filtering or reordering by the underlying model

If this fails: If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches

rust/chroma/src/embed/mod.rs:embed_strs

critical Environment unguarded

CPU feature detection via compile-time target_feature flags accurately reflects runtime capabilities, and the selected SIMD implementation will be available when the binary executes

If this fails: Binary compiled with AVX512 flags crashes with illegal instruction on CPUs without AVX512 support; performance falls back to generic implementation silently on feature mismatches

rust/distance/src/lib.rs:target_feature

warning Scale unguarded

Vector normalization uses hardcoded epsilon value of 1e-32 regardless of vector magnitude, dimensionality, or numerical precision requirements

If this fails: For high-dimensional vectors or very small magnitudes, 1e-32 epsilon causes numerical instability; for low precision contexts, the epsilon is unnecessarily small, wasting computation

rust/distance/src/lib.rs:normalize

warning Domain weakly guarded

Only RendezvousHashing assignment policy is implemented, but the trait suggests multiple policies should be supported

If this fails: Code expects to handle different assignment strategies but hard-fails on any policy configuration other than RendezvousHashing, breaking extensibility promises

rust/config/src/assignment/mod.rs:AssignmentPolicy

info Temporal unguarded

Default may_contain implementation using get() is acceptable for all cache types, ignoring that some cache implementations have more efficient bloom filters or probabilistic checks

If this fails: Cache implementations that could provide O(1) probabilistic checks instead perform expensive O(log n) or O(1) full lookups, degrading performance for existence checks

rust/cache/src/lib.rs:may_contain

warning Contract weakly guarded

All EmbeddingFunction implementations are thread-safe and can handle concurrent embed_strs calls without coordination or rate limiting

If this fails: Embedding services with rate limits or connection pools get overwhelmed by concurrent requests, causing failures or degraded service for all clients

rust/chroma/src/embed/mod.rs:EmbeddingFunction

info Shape unguarded

Benchmark vectors are fixed at 786 dimensions, but the actual distance functions should work with arbitrary vector sizes

If this fails: Benchmarks only measure performance for 786-dimensional vectors, hiding performance cliffs at different dimensions or providing misleading performance expectations

rust/distance/benches/distance_metrics.rs:x,y

critical Environment unguarded

Disk-based cache configurations assume writable filesystem with sufficient space and proper permissions at the configured path

If this fails: Cache initialization silently fails or degrades to memory-only mode when disk space is exhausted or filesystem is read-only, causing unexpected memory pressure

rust/cache/src/foyer.rs

warning Domain unguarded

BM25 tokenization and scoring parameters (k1, b values) are universal across all document collections and languages

If this fails: BM25 performance degrades significantly for document types with different length distributions or languages with different tokenization characteristics than the hardcoded parameters expect

rust/chroma/src/embed/bm25.rs

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

SysDB (database)
PostgreSQL database storing collection metadata, tenant information, and segment assignments

Distributed Log (queue)
Write-ahead log ensuring durability and ordering of all collection operations across cluster nodes

Block Storage (file-store)
Persistent Arrow-format storage for segment data with delta compression and version tracking

HNSW Index (in-memory)
In-memory approximate nearest neighbor graphs for fast similarity search within segments

Member Registry (registry)
Cluster membership and health status information for distributed coordination

Feedback Loops

Garbage Collection (polling, balancing) — Trigger: Timer-based schedule or manual kickoff. Action: GarbageCollector scans version metadata to identify and delete obsolete segments and log entries. Exit: All unreferenced data cleaned up.
Segment Compaction (auto-scale, balancing) — Trigger: Block delta accumulation exceeds threshold. Action: BlockManager merges delta changes into base Arrow blocks and updates segment files. Exit: Delta count falls below threshold.
Query Retry (retry, balancing) — Trigger: Network or node failure during distributed query. Action: DistributedExecutor re-routes query to backup segments with exponential backoff. Exit: Successful response or max retry limit.
Log Replication (convergence, reinforcing) — Trigger: Write operation to primary log node. Action: LogService replicates operation records to follower nodes until consensus achieved. Exit: Quorum acknowledgment received.

Delays

Embedding Generation (async-processing, ~Model-dependent (50ms-5s)) — Document ingestion blocks until vector representations computed
Segment Building (batch-window, ~Configurable interval) — New documents not searchable until next segment build cycle
Log Compaction (scheduled-job, ~Background process) — Log storage grows until compaction removes processed entries
HNSW Index Build (compilation, ~Linear with segment size) — Query performance degrades until index construction completes

Control Points

CHROMA_API_IMPL (architecture-switch) — Controls: Switches between local SegmentAPI and distributed backend. Default: chromadb.api.segment.SegmentAPI
IS_PERSISTENT (feature-flag) — Controls: Enables persistent storage versus in-memory only mode. Default: TRUE
CHROMA_SEGMENT_CACHE_POLICY (runtime-toggle) — Controls: Segment caching strategy (LRU, unbounded, disk-based)
embedding_function (architecture-switch) — Controls: Choice of embedding model (default, Ollama, BM25, custom)
HNSW_SPACE (distance-metric) — Controls: Vector distance calculation method (cosine, euclidean, inner product). Default: cosine
n_results (threshold) — Controls: Maximum number of results returned per query. Default: 10

Technology Stack

FastAPI (framework)
Provides HTTP REST API with automatic OpenAPI documentation and async request handling

Apache Arrow (serialization)
Columnar storage format for efficient segment persistence and analytics-optimized data layouts

HNSW (library)
Hierarchical navigable small world algorithm for approximate nearest neighbor vector search

Tokio (runtime)
Async runtime powering all Rust components with cooperative multitasking and I/O

gRPC/Tonic (framework)
Inter-service communication protocol for distributed components with type-safe APIs

SQLx (database)
Database connection pooling and query execution for PostgreSQL metadata storage

OpenTelemetry (infra)
Distributed tracing and metrics collection across Python and Rust components

Pydantic (serialization)
Runtime type validation and serialization for Python API request/response models

Object Store (storage)
Abstraction over S3/GCS/local filesystem for segment and log storage backends

Key Components

SegmentAPI (orchestrator) — Main API coordinator that routes collection operations to appropriate segments and manages distributed execution chromadb/api/segment.py
FastAPI (gateway) — HTTP REST server providing collection management, authentication, and tenant isolation endpoints chromadb/server/fastapi/fastapi.py
LocalSegment (processor) — In-memory vector index using HNSW algorithm for similarity search within a single segment chromadb/segment/impl/vector/local_hnsw.py
BlockManager (store) — Manages persistent Arrow-format blocks with delta compression and transaction isolation rust/blockstore/src/arrow/blockfile.rs
LogService (scheduler) — Distributed write-ahead log ensuring operation ordering and durability across the cluster rust/log-service/src/lib.rs
DistributedExecutor (dispatcher) — Routes query plans to appropriate worker nodes and aggregates results from multiple segments rust/frontend/src/executor/distributed.rs
GarbageCollector (monitor) — Removes obsolete segments, log entries, and storage blocks based on version tracking and reference counting rust/garbage_collector/src/garbage_collector_component.rs
MemberlistProvider (registry) — Maintains cluster topology and node health status for distributed coordination rust/memberlist/src/memberlist_provider.rs
AssignmentPolicy (allocator) — Determines which nodes should handle specific collections and segments using consistent hashing rust/config/src/assignment/assignment_policy.rs
EmbeddingFunction (transformer) — Converts text documents to vector embeddings using configurable models like Ollama or BM25 rust/chroma/src/embed/mod.rs

Package Structure

chromadb (app)
Python API server and client library providing REST endpoints and embedding management

rust-core (library)
High-performance storage, indexing, and distributed execution engine

js-client (library)
JavaScript client library for browser and Node.js environments

go-services (app)
Go microservices for specific distributed system components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Data Pipeline Repositories

Frequently Asked Questions

What is chroma used for?

Manages vector embeddings at scale with distributed search, persistent storage, and collection-based organization chroma-core/chroma is a 10-component data pipeline written in Rust. Data flows through 9 distinct pipeline stages. The codebase contains 1161 files.

How is chroma architected?

chroma is organized into 5 architecture layers: Client Libraries, API Layer, Execution Engine, Storage Layer, and 1 more. Data flows through 9 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through chroma?

Data moves through 9 stages: Document ingestion → Embedding generation → Log persistence → Segment assignment → Vector indexing → .... Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client. This pipeline design reflects a complex multi-stage processing system.

What technologies does chroma use?

The core stack includes FastAPI (Provides HTTP REST API with automatic OpenAPI documentation and async request handling), Apache Arrow (Columnar storage format for efficient segment persistence and analytics-optimized data layouts), HNSW (Hierarchical navigable small world algorithm for approximate nearest neighbor vector search), Tokio (Async runtime powering all Rust components with cooperative multitasking and I/O), gRPC/Tonic (Inter-service communication protocol for distributed components with type-safe APIs), SQLx (Database connection pooling and query execution for PostgreSQL metadata storage), and 3 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does chroma have?

chroma exhibits 5 data pools (SysDB, Distributed Log), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does chroma use?

6 design patterns detected: Segment-based Partitioning, Write-Ahead Logging, Delta Compression, Consistent Hashing, Pluggable Execution, and 1 more.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.