Hidden Assumptions in chroma
12 assumptions this code never checks · 4 critical · spanning Resource, Shape, Environment, Scale, Domain, Temporal, Contract, Ordering
Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at chroma-core/chroma and picked out the few most likely to cause trouble. The full list is just below.
Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".
If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches
Binary compiled with AVX512 flags crashes with illegal instruction on CPUs without AVX512 support; performance falls back to generic implementation silently on feature mismatches
Cache initialization silently fails or degrades to memory-only mode when disk space is exhausted or filesystem is read-only, causing unexpected memory pressure
Show everything (9 more)
DEFAULT_NUM_PARTITIONS constant of 32768 is appropriate for all workloads and memory constraints, regardless of system resources or concurrent access patterns
If this fails: On memory-constrained systems, allocating 32K mutexes wastes significant memory; on high-concurrency systems, hash collisions create bottlenecks and false sharing
rust/cache/src/async_partitioned_mutex.rs:DEFAULT_NUM_PARTITIONS
Vector normalization uses hardcoded epsilon value of 1e-32 regardless of vector magnitude, dimensionality, or numerical precision requirements
If this fails: For high-dimensional vectors or very small magnitudes, 1e-32 epsilon causes numerical instability; for low precision contexts, the epsilon is unnecessarily small, wasting computation
rust/distance/src/lib.rs:normalize
Only RendezvousHashing assignment policy is implemented, but the trait suggests multiple policies should be supported
If this fails: Code expects to handle different assignment strategies but hard-fails on any policy configuration other than RendezvousHashing, breaking extensibility promises
rust/config/src/assignment/mod.rs:AssignmentPolicy
Default may_contain implementation using get() is acceptable for all cache types, ignoring that some cache implementations have more efficient bloom filters or probabilistic checks
If this fails: Cache implementations that could provide O(1) probabilistic checks instead perform expensive O(log n) or O(1) full lookups, degrading performance for existence checks
rust/cache/src/lib.rs:may_contain
All EmbeddingFunction implementations are thread-safe and can handle concurrent embed_strs calls without coordination or rate limiting
If this fails: Embedding services with rate limits or connection pools get overwhelmed by concurrent requests, causing failures or degraded service for all clients
rust/chroma/src/embed/mod.rs:EmbeddingFunction
Benchmark vectors are fixed at 786 dimensions, but the actual distance functions should work with arbitrary vector sizes
If this fails: Benchmarks only measure performance for 786-dimensional vectors, hiding performance cliffs at different dimensions or providing misleading performance expectations
rust/distance/benches/distance_metrics.rs:x,y
BM25 tokenization and scoring parameters (k1, b values) are universal across all document collections and languages
If this fails: BM25 performance degrades significantly for document types with different length distributions or languages with different tokenization characteristics than the hardcoded parameters expect
rust/chroma/src/embed/bm25.rs
Batch size for embed_strs is unlimited and embedding services can handle arbitrarily large batches without memory exhaustion or timeout
If this fails: Large document batches cause embedding service OOM or request timeouts, with no automatic chunking or backoff mechanism to handle the failure gracefully
rust/chroma/src/embed/mod.rs:embed_strs
Registry component initialization and dependency resolution happen in the correct order, with no circular dependencies between configured components
If this fails: Component initialization deadlocks or fails with cryptic errors when services have circular dependencies that aren't detected at configuration time
rust/config/src/registry.rs
See the full structural analysis of chroma: the pipeline, data models, and system behavior that put these assumptions in context.
Full analysis of chroma-core/chroma →Frequently Asked Questions
What does chroma assume that could break in production?
The one most likely to cause trouble: All embedding functions return embeddings in the same order as input strings, with exact 1:1 correspondence and no filtering or reordering by the underlying model If this fails, If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches
How many hidden assumptions does chroma have?
CodeSea found 12 assumptions chroma relies on but never validates, 4 of them critical, spanning Resource, Shape, Environment, Scale, Domain, Temporal, Contract, Ordering. Most are routine — the analysis flags the two or three most likely to actually bite.
What is a hidden assumption?
Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.