Hidden Assumptions in spaCy
13 assumptions this code never checks · 5 critical · spanning Environment, Domain, Resource, Contract, Scale, Temporal, Ordering, Shape
Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at explosion/spacy and picked out the few most likely to cause trouble. The full list is just below.
Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".
CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted
Processing user-generated content or files with unknown encoding causes segmentation faults that terminate the Python process
Training crashes with OOM when dataset size approaches system memory limits, especially on cloud instances with limited RAM
Show everything (10 more)
Example objects have aligned predicted and reference Doc objects with matching tokenization - misalignment causes silent gradient corruption
If this fails: Training produces models that perform worse than random on misaligned data, but loss curves look normal so the problem goes undetected
spacy/pipeline/TrainablePipe.update()
String store won't exceed 2^32 unique strings - uses 32-bit integer IDs for string interning
If this fails: Processing very large corpora with millions of unique tokens causes integer overflow, leading to hash collisions and corrupted vocabulary mappings
spacy/vocab.py:Vocab
Model checkpoints remain valid across training restarts - no version checking for config schema changes or component updates
If this fails: Resuming training from old checkpoints after spaCy updates silently loads incompatible weights, causing training instability or wrong predictions
spacy/cli/train.py:training_loop
Pipeline components are applied in the exact order specified in config.nlp.pipeline - no dependency resolution or validation
If this fails: Incorrectly ordered components (e.g., NER before tagger) produce degraded results without errors, making debugging pipeline performance issues difficult
spacy/language.py:pipeline
Shell command execution inherits safe PATH and environment variables - no sanitization of subprocess execution context
If this fails: Malicious model packages or config files can execute arbitrary shell commands during spacy download or training setup
spacy/cli/_util.py:run_command
All Examples in a training batch have compatible tensor shapes after padding - batch creation doesn't validate dimension consistency
If this fails: Mixed sequence lengths or incompatible feature dimensions in batches cause cryptic CUDA/PyTorch errors during forward pass
spacy/training/:batch_formation
Serialized Doc objects won't exceed available disk space or memory during bulk operations - no size estimation or chunking
If this fails: Large corpus processing fills up disk space or exhausts memory during serialization, causing data loss if the process is interrupted
spacy/tokens/docbin.py:DocBin.to_bytes()
Matcher patterns reference valid token attributes and POS tags that exist in the loaded model's vocab - no validation against model capabilities
If this fails: Patterns using attributes not supported by the current model fail silently, returning empty matches instead of raising informative errors
spacy/matcher/:pattern_matching
Component factory functions registered in the global registry are thread-safe and stateless - no isolation between concurrent model loading
If this fails: Loading multiple models concurrently or using spaCy in multi-threaded applications causes race conditions in component initialization
spacy/util.py:registry
Model packages are compatible with the currently installed spaCy version - compatibility checks use only major.minor version matching
If this fails: Patch version mismatches between models and spaCy installation can cause subtle behavior changes or performance regressions
spacy/about.py:__version__
See the full structural analysis of spaCy: the pipeline, data models, and system behavior that put these assumptions in context.
Full analysis of explosion/spacy →Frequently Asked Questions
What does spaCy assume that could break in production?
The one most likely to cause trouble: GPU is available and functional when require_gpu() is called - no fallback mechanism exists If this fails, CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted
How many hidden assumptions does spaCy have?
CodeSea found 13 assumptions spaCy relies on but never validates, 5 of them critical, spanning Environment, Domain, Resource, Contract, Scale, Temporal, Ordering, Shape. Most are routine — the analysis flags the two or three most likely to actually bite.
What is a hidden assumption?
Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.