chroma-core/chroma
Data infrastructure for AI
Multi-language vector database infrastructure for AI applications with embedded and distributed modes
Documents flow through embedding generation, multi-modal indexing, and storage, then similarity search retrieves results by combining vector, text, and metadata queries
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
Structural Verdict
A 12-component fullstack with 9 connections. 1156 files analyzed. Well-connected — clear data flow between components.
How Data Flows Through the System
Documents flow through embedding generation, multi-modal indexing, and storage, then similarity search retrieves results by combining vector, text, and metadata queries
- Ingest — Documents and metadata enter via Client.add() with automatic embedding generation (config: collection.embedding_function, api.batch_size)
- Index — Embeddings indexed in HNSW, documents in full-text index, metadata in inverted indexes (config: index.hnsw_config, index.fts_enabled)
- Store — Data written to Arrow-based BlockStore with S3 persistence and WAL durability (config: storage.s3_config, log.retention_policy)
- Query — Client.query() triggers multi-modal search across vector, text, and metadata indexes (config: query.n_results, query.include_fields)
- Retrieve — Results merged from all indexes and returned with distances/ranks (config: distance.metric, search.merge_strategy)
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Column-oriented Arrow data blocks stored in S3
Collection metadata and segment mappings
Durability log for uncommitted operations
Hot vector indexes and query results
Feedback Loops
- Compaction Loop (scheduled-job, balancing) — Trigger: Block size thresholds or time intervals. Action: Merge small blocks into larger ones for query efficiency. Exit: Target block size achieved.
- Garbage Collection (scheduled-job, balancing) — Trigger: Version retention policies. Action: Clean up unused data versions and files. Exit: Storage quota maintained.
- Index Rebuild (auto-scale, balancing) — Trigger: Index degradation or data drift. Action: Rebuild HNSW index from current data. Exit: Query performance restored.
Delays & Async Processing
- Embedding Generation (async-processing, ~100ms-5s per batch) — Documents unavailable for search until embeddings computed
- Index Sync (eventual-consistency, ~seconds to minutes) — New data may not appear in search results immediately
- S3 Write Latency (async-processing, ~10-100ms) — Data persistence delays
- Cache TTL (cache-ttl, ~configurable) — Index rebuilds when cache expires
Control Points
- Embedding Function (feature-flag) — Controls: Which embedding model processes text documents. Default: default_ef
- HNSW Parameters (threshold) — Controls: Index construction and search quality vs speed tradeoff. Default: M=16, ef_construction=200
- Batch Size (threshold) — Controls: Number of documents processed in single operation. Default: 1000
- Distance Metric (feature-flag) — Controls: Vector similarity calculation method. Default: cosine
- Compaction Trigger (threshold) — Controls: When to merge data blocks for efficiency. Default: size or time based
Package Structure
This monorepo contains 5 packages:
High-performance Rust implementation of Chroma's core database engine with 35+ microservice components
Python API and server implementation providing the primary user interface and orchestration layer
JavaScript/TypeScript client library for browser and Node.js environments
Next-generation JavaScript client with improved performance and features
Go client library and utilities for Chroma database operations
Technology Stack
High-performance storage and indexing engine
Client API and orchestration layer
Column-oriented data format and zero-copy serialization
Approximate nearest neighbor vector search algorithm
Full-text search engine
Metadata catalog and system database
Object storage for data blocks and logs
Inter-service communication in distributed mode
HTTP REST API server
Distributed tracing and metrics
Key Components
- Client (class) — Main Python client providing the 4-function API (add, query, get, delete)
chromadb/__init__.py - SegmentAPI (class) — Embedded mode implementation handling collections and segments directly
chromadb/api/segment.py - FastAPI (module) — HTTP server providing REST API for client-server mode
chromadb/server/fastapi/app.py - Executor (module) — Query execution engine routing to local or distributed workers
rust/frontend/src/executor/mod.rs - Worker (service) — Distributed worker process handling query execution and indexing
rust/worker/src/lib.rs - BlockStore (service) — Column-oriented storage system for vectors, metadata, and documents
rust/blockstore/src/lib.rs - HnswIndex (class) — HNSW approximate nearest neighbor index for vector similarity search
rust/index/src/hnsw/mod.rs - FullTextIndex (module) — Tantivy-based full-text search index for document content
rust/index/src/fulltext/mod.rs - LogService (service) — Write-ahead log service for durability and replication
rust/log-service/src/lib.rs - SysDB (service) — Metadata catalog storing collection definitions and segment mappings
rust/sysdb/src/lib.rs - GarbageCollector (service) — Background service cleaning up unused data versions and files
rust/garbage_collector/src/lib.rs - EmbeddingFunction (trait) — Abstraction for converting text to embeddings (BM25, Ollama, etc.)
rust/chroma/src/embed/mod.rs
Sub-Modules
Performance testing framework for various vector search datasets and algorithms
Command-line utilities for database administration and operations
Python and JavaScript bindings for Rust components
Memory profiling HTTP server for jemalloc heap analysis
Configuration
bandit.yaml (yaml)
exclude_dirs(array, unknown) — default: chromadb/test,bin,build,build,.git,.venv,venv,env,.github,examples,clients/js,.vscodetests(array, unknown)skips(array, unknown)
docker-compose.server.example.yml (yaml)
version(string, unknown) — default: 3.9networks.net.driver(string, unknown) — default: bridgeservices.server.image(string, unknown) — default: ghcr.io/chroma-core/chroma:latestservices.server.environment(array, unknown) — default: IS_PERSISTENT=TRUEservices.server.volumes(array, unknown) — default: chroma-data:/chroma/chroma/services.server.ports(array, unknown) — default: 8000:8000services.server.networks(array, unknown) — default: netvolumes.chroma-data.driver(string, unknown) — default: local
docker-compose.test-auth.yml (yaml)
networks.test_net.driver(string, unknown) — default: bridgeservices.test_server.build.context(string, unknown) — default: .services.test_server.build.dockerfile(string, unknown) — default: Dockerfileservices.test_server.volumes(array, unknown) — default: chroma-data:/chroma/chromaservices.test_server.command(string, unknown) — default: --workers 1 --host 0.0.0.0 --port 8000 --proxy-headers --log-config chromadb/log_config.yml --timeout-keep-alive 30services.test_server.environment(array, unknown) — default: ANONYMIZED_TELEMETRY=False,ALLOW_RESET=True,IS_PERSISTENT=True,CHROMA_SERVER_AUTHN_CREDENTIALS_FILE=${CHROMA_SERVER_AUTHN_CREDENTIALS_FILE},CHROMA_SERVER_AUTHN_CREDENTIALS=${CHROMA_SERVER_AUTHN_CREDENTIALS},CHROMA_SERVER_AUTHN_PROVIDER=${CHROMA_SERVER_AUTHN_PROVIDER},CHROMA_AUTH_TOKEN_TRANSPORT_HEADER=${CHROMA_AUTH_TOKEN_TRANSPORT_HEADER}services.test_server.ports(array, unknown) — default: ${CHROMA_PORT:-8000}:8000services.test_server.networks(array, unknown) — default: test_net- +1 more parameters
docker-compose.test.yml (yaml)
services.test_server.build.context(string, unknown) — default: .services.test_server.build.dockerfile(string, unknown) — default: ${DOCKERFILE:-Dockerfile}services.test_server.platform(string, unknown) — default: ${PLATFORM:-linux/amd64}services.test_server.command(string, unknown) — default: --workers 1 --host 0.0.0.0 --port 8000 --proxy-headers --log-config chromadb/log_config.yml --timeout-keep-alive 30services.test_server.environment(array, unknown) — default: ANONYMIZED_TELEMETRY=False,ALLOW_RESET=True,IS_PERSISTENT=True,CHROMA_API_IMPL=chromadb.api.segment.SegmentAPIservices.test_server.ports(array, unknown) — default: ${CHROMA_PORT:-8000}:8000volumes.chroma-data.driver(string, unknown) — default: local
Science Pipeline
- Document Ingestion — Text strings converted to embeddings via EmbeddingFunction [(batch_size, text) → (batch_size, embedding_dim)]
rust/chroma/src/embed/mod.rs - Vector Indexing — HNSW index construction from embedding vectors [(n_vectors, embedding_dim) → graph structure]
rust/index/src/hnsw/index.rs - Block Storage — Column-oriented storage in Arrow format with delta compression [(n_records, n_columns) → compressed blocks]
rust/blockstore/src/arrow/block/ - Similarity Search — HNSW traversal with distance calculations [(query_embedding_dim,) → (k_results, embedding_dim + metadata)]
rust/index/src/hnsw/search.rs
Assumptions & Constraints
- [critical] Vector distance functions assume equal-length vectors but no runtime validation enforces this (shape)
- [warning] HNSW index assumes normalized vectors for cosine distance but normalization is optional (value-range)
- [warning] Arrow schema assumes specific column layouts for embeddings (f32 arrays) without schema validation (format)
- [info] Embedding functions assumed to be deterministic and stateless but no enforcement exists (dependency)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Fullstack Repositories
Frequently Asked Questions
What is chroma used for?
Multi-language vector database infrastructure for AI applications with embedded and distributed modes chroma-core/chroma is a 12-component fullstack written in Rust. Well-connected — clear data flow between components. The codebase contains 1156 files.
How is chroma architected?
chroma is organized into 4 architecture layers: Client APIs, Orchestration, Rust Engine, Storage. Well-connected — clear data flow between components. This layered structure enables tight integration between components.
How does data flow through chroma?
Data moves through 5 stages: Ingest → Index → Store → Query → Retrieve. Documents flow through embedding generation, multi-modal indexing, and storage, then similarity search retrieves results by combining vector, text, and metadata queries This pipeline design reflects a complex multi-stage processing system.
What technologies does chroma use?
The core stack includes Rust (High-performance storage and indexing engine), Python (Client API and orchestration layer), Apache Arrow (Column-oriented data format and zero-copy serialization), HNSW (Approximate nearest neighbor vector search algorithm), Tantivy (Full-text search engine), SQLite/PostgreSQL (Metadata catalog and system database), and 4 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does chroma have?
chroma exhibits 4 data pools (BlockStore, SysDB), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle scheduled-job and scheduled-job. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does chroma use?
5 design patterns detected: Hybrid Architecture, Arrow-based Storage, Multi-Modal Indexing, Rust-Python Bridge, Component System.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.