chroma-core/chroma
Data infrastructure for AI
Manages vector embeddings at scale with distributed search, persistent storage, and collection-based organization
Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client.
Under the hood, the system uses 4 feedback loops, 5 data pools, 6 control points to manage its runtime behavior.
A 10-component data pipeline. 1161 files analyzed. Data flows through 9 distinct pipeline stages.
How Data Flows Through the System
Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client.
- Document ingestion — Client libraries call collection.add() with documents, metadata, and IDs, which get validated and batched by the FastAPI server [Collection → EmbeddingRecord]
- Embedding generation — EmbeddingFunction implementations (Ollama, BM25) convert document text to vector representations with configurable model parameters [EmbeddingRecord → EmbeddingRecord]
- Log persistence — LogService writes operation records to distributed WAL with sequence numbers and replication across cluster nodes [EmbeddingRecord → LogRecord]
- Segment assignment — AssignmentPolicy uses consistent hashing to determine which nodes should store collection segments based on collection ID and cluster topology [Collection → Segment]
- Vector indexing — LocalSegment builds HNSW indices from embedding vectors, maintaining approximate nearest neighbor graph structures in memory [EmbeddingRecord]
- Block storage — BlockManager persists segment data as Arrow-format blocks with delta compression, managing file I/O and transaction isolation [EmbeddingRecord → BlockDelta]
- Query planning — SegmentAPI analyzes query parameters (vector, filters, n_results) and creates execution plans targeting relevant segments [QueryResult]
- Distributed search — DistributedExecutor sends query vectors to assigned segments, performs parallel HNSW searches, and collects distance-ranked results
- Result aggregation — Query executor merges segment results by distance score, applies global k-limit, and formats response with requested fields (documents, metadata, embeddings) [QueryResult → QueryResult]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
chromadb/api/types.pyPython dataclass with id: str, name: str, metadata: Optional[CollectionMetadata], tenant: str, database: str
Created via API with embedding function, stores documents with vector embeddings, queried for similarity search
rust/types/src/record.rsRust struct with id: String, embedding: Option<Vec<f32>>, document: Option<String>, metadata: Option<UpdateMetadata>
Generated from user documents during add/upsert operations, persisted in segments, retrieved during queries
chromadb/types.pyPython dataclass with id: SegmentUUID, type: str, scope: SegmentScope, collection: Optional[CollectionUUID], metadata: Optional[Dict]
Created when collection data is partitioned, assigned to nodes for processing, garbage collected when obsolete
chromadb/api/types.pyPython dataclass with ids: List[List[str]], distances: Optional[List[List[float]]], metadatas: Optional[List[List[Dict]]], documents: Optional[List[List[str]]], embeddings: Optional[List[List[Embedding]]]
Built by query executors from segment results, ranked by distance, returned to client with requested fields
rust/blockstore/src/arrow/block/delta/types.rsRust enum containing DataRecordDelta, QuantizedClusterDelta, or SpannPostingListDelta with operation type and data changes
Created during segment updates, accumulated in memory, flushed to persistent blocks during compaction
rust/log/src/log.rsRust struct with collection_id: CollectionUuid, log_offset: u64, record: Box<OperationRecord>
Written during every collection operation, replicated to distributed log service, read during segment building and recovery
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
DEFAULT_NUM_PARTITIONS constant of 32768 is appropriate for all workloads and memory constraints, regardless of system resources or concurrent access patterns
If this fails: On memory-constrained systems, allocating 32K mutexes wastes significant memory; on high-concurrency systems, hash collisions create bottlenecks and false sharing
rust/cache/src/async_partitioned_mutex.rs:DEFAULT_NUM_PARTITIONS
All embedding functions return embeddings in the same order as input strings, with exact 1:1 correspondence and no filtering or reordering by the underlying model
If this fails: If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches
rust/chroma/src/embed/mod.rs:embed_strs
CPU feature detection via compile-time target_feature flags accurately reflects runtime capabilities, and the selected SIMD implementation will be available when the binary executes
If this fails: Binary compiled with AVX512 flags crashes with illegal instruction on CPUs without AVX512 support; performance falls back to generic implementation silently on feature mismatches
rust/distance/src/lib.rs:target_feature
Vector normalization uses hardcoded epsilon value of 1e-32 regardless of vector magnitude, dimensionality, or numerical precision requirements
If this fails: For high-dimensional vectors or very small magnitudes, 1e-32 epsilon causes numerical instability; for low precision contexts, the epsilon is unnecessarily small, wasting computation
rust/distance/src/lib.rs:normalize
Only RendezvousHashing assignment policy is implemented, but the trait suggests multiple policies should be supported
If this fails: Code expects to handle different assignment strategies but hard-fails on any policy configuration other than RendezvousHashing, breaking extensibility promises
rust/config/src/assignment/mod.rs:AssignmentPolicy
Default may_contain implementation using get() is acceptable for all cache types, ignoring that some cache implementations have more efficient bloom filters or probabilistic checks
If this fails: Cache implementations that could provide O(1) probabilistic checks instead perform expensive O(log n) or O(1) full lookups, degrading performance for existence checks
rust/cache/src/lib.rs:may_contain
All EmbeddingFunction implementations are thread-safe and can handle concurrent embed_strs calls without coordination or rate limiting
If this fails: Embedding services with rate limits or connection pools get overwhelmed by concurrent requests, causing failures or degraded service for all clients
rust/chroma/src/embed/mod.rs:EmbeddingFunction
Benchmark vectors are fixed at 786 dimensions, but the actual distance functions should work with arbitrary vector sizes
If this fails: Benchmarks only measure performance for 786-dimensional vectors, hiding performance cliffs at different dimensions or providing misleading performance expectations
rust/distance/benches/distance_metrics.rs:x,y
Disk-based cache configurations assume writable filesystem with sufficient space and proper permissions at the configured path
If this fails: Cache initialization silently fails or degrades to memory-only mode when disk space is exhausted or filesystem is read-only, causing unexpected memory pressure
rust/cache/src/foyer.rs
BM25 tokenization and scoring parameters (k1, b values) are universal across all document collections and languages
If this fails: BM25 performance degrades significantly for document types with different length distributions or languages with different tokenization characteristics than the hardcoded parameters expect
rust/chroma/src/embed/bm25.rs
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
PostgreSQL database storing collection metadata, tenant information, and segment assignments
Write-ahead log ensuring durability and ordering of all collection operations across cluster nodes
Persistent Arrow-format storage for segment data with delta compression and version tracking
In-memory approximate nearest neighbor graphs for fast similarity search within segments
Cluster membership and health status information for distributed coordination
Feedback Loops
- Garbage Collection (polling, balancing) — Trigger: Timer-based schedule or manual kickoff. Action: GarbageCollector scans version metadata to identify and delete obsolete segments and log entries. Exit: All unreferenced data cleaned up.
- Segment Compaction (auto-scale, balancing) — Trigger: Block delta accumulation exceeds threshold. Action: BlockManager merges delta changes into base Arrow blocks and updates segment files. Exit: Delta count falls below threshold.
- Query Retry (retry, balancing) — Trigger: Network or node failure during distributed query. Action: DistributedExecutor re-routes query to backup segments with exponential backoff. Exit: Successful response or max retry limit.
- Log Replication (convergence, reinforcing) — Trigger: Write operation to primary log node. Action: LogService replicates operation records to follower nodes until consensus achieved. Exit: Quorum acknowledgment received.
Delays
- Embedding Generation (async-processing, ~Model-dependent (50ms-5s)) — Document ingestion blocks until vector representations computed
- Segment Building (batch-window, ~Configurable interval) — New documents not searchable until next segment build cycle
- Log Compaction (scheduled-job, ~Background process) — Log storage grows until compaction removes processed entries
- HNSW Index Build (compilation, ~Linear with segment size) — Query performance degrades until index construction completes
Control Points
- CHROMA_API_IMPL (architecture-switch) — Controls: Switches between local SegmentAPI and distributed backend. Default: chromadb.api.segment.SegmentAPI
- IS_PERSISTENT (feature-flag) — Controls: Enables persistent storage versus in-memory only mode. Default: TRUE
- CHROMA_SEGMENT_CACHE_POLICY (runtime-toggle) — Controls: Segment caching strategy (LRU, unbounded, disk-based)
- embedding_function (architecture-switch) — Controls: Choice of embedding model (default, Ollama, BM25, custom)
- HNSW_SPACE (distance-metric) — Controls: Vector distance calculation method (cosine, euclidean, inner product). Default: cosine
- n_results (threshold) — Controls: Maximum number of results returned per query. Default: 10
Technology Stack
Provides HTTP REST API with automatic OpenAPI documentation and async request handling
Columnar storage format for efficient segment persistence and analytics-optimized data layouts
Hierarchical navigable small world algorithm for approximate nearest neighbor vector search
Async runtime powering all Rust components with cooperative multitasking and I/O
Inter-service communication protocol for distributed components with type-safe APIs
Database connection pooling and query execution for PostgreSQL metadata storage
Distributed tracing and metrics collection across Python and Rust components
Runtime type validation and serialization for Python API request/response models
Abstraction over S3/GCS/local filesystem for segment and log storage backends
Key Components
- SegmentAPI (orchestrator) — Main API coordinator that routes collection operations to appropriate segments and manages distributed execution
chromadb/api/segment.py - FastAPI (gateway) — HTTP REST server providing collection management, authentication, and tenant isolation endpoints
chromadb/server/fastapi/fastapi.py - LocalSegment (processor) — In-memory vector index using HNSW algorithm for similarity search within a single segment
chromadb/segment/impl/vector/local_hnsw.py - BlockManager (store) — Manages persistent Arrow-format blocks with delta compression and transaction isolation
rust/blockstore/src/arrow/blockfile.rs - LogService (scheduler) — Distributed write-ahead log ensuring operation ordering and durability across the cluster
rust/log-service/src/lib.rs - DistributedExecutor (dispatcher) — Routes query plans to appropriate worker nodes and aggregates results from multiple segments
rust/frontend/src/executor/distributed.rs - GarbageCollector (monitor) — Removes obsolete segments, log entries, and storage blocks based on version tracking and reference counting
rust/garbage_collector/src/garbage_collector_component.rs - MemberlistProvider (registry) — Maintains cluster topology and node health status for distributed coordination
rust/memberlist/src/memberlist_provider.rs - AssignmentPolicy (allocator) — Determines which nodes should handle specific collections and segments using consistent hashing
rust/config/src/assignment/assignment_policy.rs - EmbeddingFunction (transformer) — Converts text documents to vector embeddings using configurable models like Ollama or BM25
rust/chroma/src/embed/mod.rs
Package Structure
Python API server and client library providing REST endpoints and embedding management
High-performance storage, indexing, and distributed execution engine
JavaScript client library for browser and Node.js environments
Go microservices for specific distributed system components
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Data Pipeline Repositories
Frequently Asked Questions
What is chroma used for?
Manages vector embeddings at scale with distributed search, persistent storage, and collection-based organization chroma-core/chroma is a 10-component data pipeline written in Rust. Data flows through 9 distinct pipeline stages. The codebase contains 1161 files.
How is chroma architected?
chroma is organized into 5 architecture layers: Client Libraries, API Layer, Execution Engine, Storage Layer, and 1 more. Data flows through 9 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through chroma?
Data moves through 9 stages: Document ingestion → Embedding generation → Log persistence → Segment assignment → Vector indexing → .... Documents enter through client libraries, get converted to embeddings, stored in segments with metadata, and queried through distributed vector search. The ChromaDB client creates collections and adds documents with text content. The EmbeddingFunction transforms text into vector embeddings while preserving document metadata. These records flow through the SegmentAPI to LocalSegments for indexing and the LogService for durability. Query vectors follow the same embedding path, then the DistributedExecutor routes them to relevant segments, performs HNSW similarity search, and aggregates ranked results back to the client. This pipeline design reflects a complex multi-stage processing system.
What technologies does chroma use?
The core stack includes FastAPI (Provides HTTP REST API with automatic OpenAPI documentation and async request handling), Apache Arrow (Columnar storage format for efficient segment persistence and analytics-optimized data layouts), HNSW (Hierarchical navigable small world algorithm for approximate nearest neighbor vector search), Tokio (Async runtime powering all Rust components with cooperative multitasking and I/O), gRPC/Tonic (Inter-service communication protocol for distributed components with type-safe APIs), SQLx (Database connection pooling and query execution for PostgreSQL metadata storage), and 3 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does chroma have?
chroma exhibits 5 data pools (SysDB, Distributed Log), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does chroma use?
6 design patterns detected: Segment-based Partitioning, Write-Ahead Logging, Delta Compression, Consistent Hashing, Pluggable Execution, and 1 more.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.