chroma-core/chroma

Data infrastructure for AI

27,516 stars Rust 10 components

Manages vector embeddings at scale with distributed search, persistent storage, and collection-based organization

Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client.

Under the hood, the system uses 4 feedback loops, 5 data pools, 6 control points to manage its runtime behavior.

A 10-component data pipeline. 1161 files analyzed. Data flows through 9 distinct pipeline stages.

How Data Flows Through the System

Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client.

  1. Document ingestion — Client libraries call collection.add() with documents, metadata, and IDs, which get validated and batched by the FastAPI server [Collection → EmbeddingRecord]
  2. Embedding generation — EmbeddingFunction implementations (Ollama, BM25) convert document text to vector representations with configurable model parameters [EmbeddingRecord → EmbeddingRecord]
  3. Log persistence — LogService writes operation records to distributed WAL with sequence numbers and replication across cluster nodes [EmbeddingRecord → LogRecord]
  4. Segment assignment — AssignmentPolicy uses consistent hashing to determine which nodes should store collection segments based on collection ID and cluster topology [Collection → Segment]
  5. Vector indexing — LocalSegment builds HNSW indices from embedding vectors, maintaining approximate nearest neighbor graph structures in memory [EmbeddingRecord]
  6. Block storage — BlockManager persists segment data as Arrow-format blocks with delta compression, managing file I/O and transaction isolation [EmbeddingRecord → BlockDelta]
  7. Query planning — SegmentAPI analyzes query parameters (vector, filters, n_results) and creates execution plans targeting relevant segments [QueryResult]
  8. Distributed search — DistributedExecutor sends query vectors to assigned segments, performs parallel HNSW searches, and collects distance-ranked results
  9. Result aggregation — Query executor merges segment results by distance score, applies global k-limit, and formats response with requested fields (documents, metadata, embeddings) [QueryResult → QueryResult]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Collection chromadb/api/types.py
Python dataclass with id: str, name: str, metadata: Optional[CollectionMetadata], tenant: str, database: str
Created via API with embedding function, stores documents with vector embeddings, queried for similarity search
EmbeddingRecord rust/types/src/record.rs
Rust struct with id: String, embedding: Option<Vec<f32>>, document: Option<String>, metadata: Option<UpdateMetadata>
Generated from user documents during add/upsert operations, persisted in segments, retrieved during queries
Segment chromadb/types.py
Python dataclass with id: SegmentUUID, type: str, scope: SegmentScope, collection: Optional[CollectionUUID], metadata: Optional[Dict]
Created when collection data is partitioned, assigned to nodes for processing, garbage collected when obsolete
QueryResult chromadb/api/types.py
Python dataclass with ids: List[List[str]], distances: Optional[List[List[float]]], metadatas: Optional[List[List[Dict]]], documents: Optional[List[List[str]]], embeddings: Optional[List[List[Embedding]]]
Built by query executors from segment results, ranked by distance, returned to client with requested fields
BlockDelta rust/blockstore/src/arrow/block/delta/types.rs
Rust enum containing DataRecordDelta, QuantizedClusterDelta, or SpannPostingListDelta with operation type and data changes
Created during segment updates, accumulated in memory, flushed to persistent blocks during compaction
LogRecord rust/log/src/log.rs
Rust struct with collection_id: CollectionUuid, log_offset: u64, record: Box<OperationRecord>
Written during every collection operation, replicated to distributed log service, read during segment building and recovery

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

warning Resource unguarded

DEFAULT_NUM_PARTITIONS constant of 32768 is appropriate for all workloads and memory constraints, regardless of system resources or concurrent access patterns

If this fails: On memory-constrained systems, allocating 32K mutexes wastes significant memory; on high-concurrency systems, hash collisions create bottlenecks and false sharing

rust/cache/src/async_partitioned_mutex.rs:DEFAULT_NUM_PARTITIONS
critical Shape unguarded

All embedding functions return embeddings in the same order as input strings, with exact 1:1 correspondence and no filtering or reordering by the underlying model

If this fails: If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches

rust/chroma/src/embed/mod.rs:embed_strs
critical Environment unguarded

CPU feature detection via compile-time target_feature flags accurately reflects runtime capabilities, and the selected SIMD implementation will be available when the binary executes

If this fails: Binary compiled with AVX512 flags crashes with illegal instruction on CPUs without AVX512 support; performance falls back to generic implementation silently on feature mismatches

rust/distance/src/lib.rs:target_feature
warning Scale unguarded

Vector normalization uses hardcoded epsilon value of 1e-32 regardless of vector magnitude, dimensionality, or numerical precision requirements

If this fails: For high-dimensional vectors or very small magnitudes, 1e-32 epsilon causes numerical instability; for low precision contexts, the epsilon is unnecessarily small, wasting computation

rust/distance/src/lib.rs:normalize
warning Domain weakly guarded

Only RendezvousHashing assignment policy is implemented, but the trait suggests multiple policies should be supported

If this fails: Code expects to handle different assignment strategies but hard-fails on any policy configuration other than RendezvousHashing, breaking extensibility promises

rust/config/src/assignment/mod.rs:AssignmentPolicy
info Temporal unguarded

Default may_contain implementation using get() is acceptable for all cache types, ignoring that some cache implementations have more efficient bloom filters or probabilistic checks

If this fails: Cache implementations that could provide O(1) probabilistic checks instead perform expensive O(log n) or O(1) full lookups, degrading performance for existence checks

rust/cache/src/lib.rs:may_contain
warning Contract weakly guarded

All EmbeddingFunction implementations are thread-safe and can handle concurrent embed_strs calls without coordination or rate limiting

If this fails: Embedding services with rate limits or connection pools get overwhelmed by concurrent requests, causing failures or degraded service for all clients

rust/chroma/src/embed/mod.rs:EmbeddingFunction
info Shape unguarded

Benchmark vectors are fixed at 786 dimensions, but the actual distance functions should work with arbitrary vector sizes

If this fails: Benchmarks only measure performance for 786-dimensional vectors, hiding performance cliffs at different dimensions or providing misleading performance expectations

rust/distance/benches/distance_metrics.rs:x,y
critical Environment unguarded

Disk-based cache configurations assume writable filesystem with sufficient space and proper permissions at the configured path

If this fails: Cache initialization silently fails or degrades to memory-only mode when disk space is exhausted or filesystem is read-only, causing unexpected memory pressure

rust/cache/src/foyer.rs
warning Domain unguarded

BM25 tokenization and scoring parameters (k1, b values) are universal across all document collections and languages

If this fails: BM25 performance degrades significantly for document types with different length distributions or languages with different tokenization characteristics than the hardcoded parameters expect

rust/chroma/src/embed/bm25.rs

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

SysDB (database)
PostgreSQL database storing collection metadata, tenant information, and segment assignments
Distributed Log (queue)
Write-ahead log ensuring durability and ordering of all collection operations across cluster nodes
Block Storage (file-store)
Persistent Arrow-format storage for segment data with delta compression and version tracking
HNSW Index (in-memory)
In-memory approximate nearest neighbor graphs for fast similarity search within segments
Member Registry (registry)
Cluster membership and health status information for distributed coordination

Feedback Loops

Delays

Control Points

Technology Stack

FastAPI (framework)
Provides HTTP REST API with automatic OpenAPI documentation and async request handling
Apache Arrow (serialization)
Columnar storage format for efficient segment persistence and analytics-optimized data layouts
HNSW (library)
Hierarchical navigable small world algorithm for approximate nearest neighbor vector search
Tokio (runtime)
Async runtime powering all Rust components with cooperative multitasking and I/O
gRPC/Tonic (framework)
Inter-service communication protocol for distributed components with type-safe APIs
SQLx (database)
Database connection pooling and query execution for PostgreSQL metadata storage
OpenTelemetry (infra)
Distributed tracing and metrics collection across Python and Rust components
Pydantic (serialization)
Runtime type validation and serialization for Python API request/response models
Object Store (storage)
Abstraction over S3/GCS/local filesystem for segment and log storage backends

Key Components

Package Structure

chromadb (app)
Python API server and client library providing REST endpoints and embedding management
rust-core (library)
High-performance storage, indexing, and distributed execution engine
js-client (library)
JavaScript client library for browser and Node.js environments
go-services (app)
Go microservices for specific distributed system components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Data Pipeline Repositories

Frequently Asked Questions

What is chroma used for?

Manages vector embeddings at scale with distributed search, persistent storage, and collection-based organization chroma-core/chroma is a 10-component data pipeline written in Rust. Data flows through 9 distinct pipeline stages. The codebase contains 1161 files.

How is chroma architected?

chroma is organized into 5 architecture layers: Client Libraries, API Layer, Execution Engine, Storage Layer, and 1 more. Data flows through 9 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through chroma?

Data moves through 9 stages: Document ingestion → Embedding generation → Log persistence → Segment assignment → Vector indexing → .... Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client. This pipeline design reflects a complex multi-stage processing system.

What technologies does chroma use?

The core stack includes FastAPI (Provides HTTP REST API with automatic OpenAPI documentation and async request handling), Apache Arrow (Columnar storage format for efficient segment persistence and analytics-optimized data layouts), HNSW (Hierarchical navigable small world algorithm for approximate nearest neighbor vector search), Tokio (Async runtime powering all Rust components with cooperative multitasking and I/O), gRPC/Tonic (Inter-service communication protocol for distributed components with type-safe APIs), SQLx (Database connection pooling and query execution for PostgreSQL metadata storage), and 3 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does chroma have?

chroma exhibits 5 data pools (SysDB, Distributed Log), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does chroma use?

6 design patterns detected: Segment-based Partitioning, Write-Ahead Logging, Delta Compression, Consistent Hashing, Pluggable Execution, and 1 more.

Analyzed on April 20, 2026 by CodeSea. Written by .