Hidden Assumptions in chroma

12 assumptions this code never checks · 4 critical · spanning Resource, Shape, Environment, Scale, Domain, Temporal, Contract, Ordering

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at chroma-core/chroma and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches

Worth your attention first

Binary compiled with AVX512 flags crashes with illegal instruction on CPUs without AVX512 support; performance falls back to generic implementation silently on feature mismatches

Worth your attention first

Cache initialization silently fails or degrades to memory-only mode when disk space is exhausted or filesystem is read-only, causing unexpected memory pressure

Show everything (9 more)
Resource

DEFAULT_NUM_PARTITIONS constant of 32768 is appropriate for all workloads and memory constraints, regardless of system resources or concurrent access patterns

If this fails: On memory-constrained systems, allocating 32K mutexes wastes significant memory; on high-concurrency systems, hash collisions create bottlenecks and false sharing

rust/cache/src/async_partitioned_mutex.rs:DEFAULT_NUM_PARTITIONS
Scale

Vector normalization uses hardcoded epsilon value of 1e-32 regardless of vector magnitude, dimensionality, or numerical precision requirements

If this fails: For high-dimensional vectors or very small magnitudes, 1e-32 epsilon causes numerical instability; for low precision contexts, the epsilon is unnecessarily small, wasting computation

rust/distance/src/lib.rs:normalize
Domain

Only RendezvousHashing assignment policy is implemented, but the trait suggests multiple policies should be supported

If this fails: Code expects to handle different assignment strategies but hard-fails on any policy configuration other than RendezvousHashing, breaking extensibility promises

rust/config/src/assignment/mod.rs:AssignmentPolicy
Temporal

Default may_contain implementation using get() is acceptable for all cache types, ignoring that some cache implementations have more efficient bloom filters or probabilistic checks

If this fails: Cache implementations that could provide O(1) probabilistic checks instead perform expensive O(log n) or O(1) full lookups, degrading performance for existence checks

rust/cache/src/lib.rs:may_contain
Contract

All EmbeddingFunction implementations are thread-safe and can handle concurrent embed_strs calls without coordination or rate limiting

If this fails: Embedding services with rate limits or connection pools get overwhelmed by concurrent requests, causing failures or degraded service for all clients

rust/chroma/src/embed/mod.rs:EmbeddingFunction
Shape

Benchmark vectors are fixed at 786 dimensions, but the actual distance functions should work with arbitrary vector sizes

If this fails: Benchmarks only measure performance for 786-dimensional vectors, hiding performance cliffs at different dimensions or providing misleading performance expectations

rust/distance/benches/distance_metrics.rs:x,y
Domain

BM25 tokenization and scoring parameters (k1, b values) are universal across all document collections and languages

If this fails: BM25 performance degrades significantly for document types with different length distributions or languages with different tokenization characteristics than the hardcoded parameters expect

rust/chroma/src/embed/bm25.rs
Resource

Batch size for embed_strs is unlimited and embedding services can handle arbitrarily large batches without memory exhaustion or timeout

If this fails: Large document batches cause embedding service OOM or request timeouts, with no automatic chunking or backoff mechanism to handle the failure gracefully

rust/chroma/src/embed/mod.rs:embed_strs
Ordering

Registry component initialization and dependency resolution happen in the correct order, with no circular dependencies between configured components

If this fails: Component initialization deadlocks or fails with cryptic errors when services have circular dependencies that aren't detected at configuration time

rust/config/src/registry.rs

See the full structural analysis of chroma: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of chroma-core/chroma →

Frequently Asked Questions

What does chroma assume that could break in production?

The one most likely to cause trouble: All embedding functions return embeddings in the same order as input strings, with exact 1:1 correspondence and no filtering or reordering by the underlying model If this fails, If an embedding service internally filters invalid inputs or reorders for batching efficiency, the returned embeddings get misaligned with document metadata, causing wrong similarity matches

How many hidden assumptions does chroma have?

CodeSea found 12 assumptions chroma relies on but never validates, 4 of them critical, spanning Resource, Shape, Environment, Scale, Domain, Temporal, Contract, Ordering. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.