Hidden Assumptions in spaCy

13 assumptions this code never checks · 5 critical · spanning Environment, Domain, Resource, Contract, Scale, Temporal, Ordering, Shape

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at explosion/spacy and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted

Worth your attention first

Processing user-generated content or files with unknown encoding causes segmentation faults that terminate the Python process

Worth your attention first

Training crashes with OOM when dataset size approaches system memory limits, especially on cloud instances with limited RAM

Show everything (10 more)

Contract

Example objects have aligned predicted and reference Doc objects with matching tokenization - misalignment causes silent gradient corruption

If this fails: Training produces models that perform worse than random on misaligned data, but loss curves look normal so the problem goes undetected

spacy/pipeline/TrainablePipe.update()

Scale

String store won't exceed 2^32 unique strings - uses 32-bit integer IDs for string interning

If this fails: Processing very large corpora with millions of unique tokens causes integer overflow, leading to hash collisions and corrupted vocabulary mappings

spacy/vocab.py:Vocab

Temporal

Model checkpoints remain valid across training restarts - no version checking for config schema changes or component updates

If this fails: Resuming training from old checkpoints after spaCy updates silently loads incompatible weights, causing training instability or wrong predictions

spacy/cli/train.py:training_loop

Ordering

Pipeline components are applied in the exact order specified in config.nlp.pipeline - no dependency resolution or validation

If this fails: Incorrectly ordered components (e.g., NER before tagger) produce degraded results without errors, making debugging pipeline performance issues difficult

spacy/language.py:pipeline

Environment

Shell command execution inherits safe PATH and environment variables - no sanitization of subprocess execution context

If this fails: Malicious model packages or config files can execute arbitrary shell commands during spacy download or training setup

spacy/cli/_util.py:run_command

Shape

All Examples in a training batch have compatible tensor shapes after padding - batch creation doesn't validate dimension consistency

If this fails: Mixed sequence lengths or incompatible feature dimensions in batches cause cryptic CUDA/PyTorch errors during forward pass

spacy/training/:batch_formation

Resource

Serialized Doc objects won't exceed available disk space or memory during bulk operations - no size estimation or chunking

If this fails: Large corpus processing fills up disk space or exhausts memory during serialization, causing data loss if the process is interrupted

spacy/tokens/docbin.py:DocBin.to_bytes()

Contract

Matcher patterns reference valid token attributes and POS tags that exist in the loaded model's vocab - no validation against model capabilities

If this fails: Patterns using attributes not supported by the current model fail silently, returning empty matches instead of raising informative errors

spacy/matcher/:pattern_matching

Domain

Component factory functions registered in the global registry are thread-safe and stateless - no isolation between concurrent model loading

If this fails: Loading multiple models concurrently or using spaCy in multi-threaded applications causes race conditions in component initialization

spacy/util.py:registry

Temporal

Model packages are compatible with the currently installed spaCy version - compatibility checks use only major.minor version matching

If this fails: Patch version mismatches between models and spaCy installation can cause subtle behavior changes or performance regressions

spacy/about.py:__version__

See the full structural analysis of spaCy: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of explosion/spacy →

Frequently Asked Questions

What does spaCy assume that could break in production?

The one most likely to cause trouble: GPU is available and functional when require_gpu() is called - no fallback mechanism exists If this fails, CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted

How many hidden assumptions does spaCy have?

CodeSea found 13 assumptions spaCy relies on but never validates, 5 of them critical, spanning Environment, Domain, Resource, Contract, Scale, Temporal, Ordering, Shape. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.