explosion/spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Industrial-strength NLP library with neural models, pipelines and training for 70+ languages
Text flows through tokenization, then sequential pipeline components (tagger, parser, NER, etc.), with training data flowing through alignment and loss computation
Under the hood, the system uses 2 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.
Structural Verdict
A 12-component ml training with 22 connections. 954 files analyzed. Highly interconnected — components depend on each other heavily.
How Data Flows Through the System
Text flows through tokenization, then sequential pipeline components (tagger, parser, NER, etc.), with training data flowing through alignment and loss computation
- Text Input — Raw text strings are input to the Language pipeline (config: tokenizer)
- Tokenization — Text is split into tokens using language-specific rules (config: tokenizer)
- Pipeline Processing — Tokens pass through configured pipeline components (tagger, parser, NER) (config: pipeline, components)
- Training Alignment — During training, predicted annotations are aligned with gold standard (config: training.batcher)
- Loss Computation — Component losses are computed from aligned predictions and targets (config: training.dropout, training.optimizer)
- Model Updates — Gradients are backpropagated to update model weights (config: training.optimizer, training.max_epochs)
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Hash-to-string mappings and lexical attributes accumulate in vocabulary
Batches of aligned training examples buffer during training
Registered components, architectures, and functions accumulate in registry
Feedback Loops
- Training Loop (training-loop, balancing) — Trigger: train_cli command execution. Action: Process batch, compute loss, update weights. Exit: max_epochs or patience exhausted.
- Gradient Descent (convergence, balancing) — Trigger: Loss computation. Action: Backpropagate gradients and update model parameters. Exit: Convergence or max steps.
Delays & Async Processing
- Batch Processing (batch-window, ~configurable batch_size) — Examples accumulate before processing to optimize GPU utilization
- Evaluation Frequency (scheduled-job, ~eval_frequency steps) — Model evaluation is deferred to reduce training overhead
Control Points
- Pipeline Components (runtime-toggle) — Controls: Which NLP components are active in processing pipeline. Default: ['tok2vec', 'tagger', 'parser', 'ner']
- Training Dropout (threshold) — Controls: Regularization strength during training. Default: 0.1
- Batch Size (threshold) — Controls: Number of examples processed simultaneously. Default: 1000
- Max Epochs (threshold) — Controls: Maximum training iterations. Default: 0
- GPU Allocator (env-var) — Controls: GPU memory allocation strategy. Default: null
Technology Stack
Neural network framework powering model training
Performance optimization for critical paths
Configuration validation and schema definition
Numerical computations and array operations
CLI framework for command-line interface
Testing framework
Documentation website frontend
Package building and distribution
Key Components
- Language (class) — Main NLP pipeline class that processes text through configurable components
spacy/language.py - setup_cli (function) — Sets up the complete CLI interface with all spaCy commands
spacy/cli/_util.py - ConfigSchemaTraining (class) — Pydantic schema defining training configuration parameters like epochs, dropout, patience
spacy/schemas.py - Example (class) — Training example class holding predicted and gold-standard annotations
spacy/training/example.py - train_cli (cli-command) — CLI command for training spaCy models from configuration files
spacy/cli/train.py - TrainablePipe (class) — Base class for trainable pipeline components like NER, parser, text classifier
spacy/pipeline/ - convert (cli-command) — CLI command to convert training data between formats (CoNLL, JSON, etc.)
spacy/cli/convert.py - debug_data_cli (cli-command) — CLI command to analyze training data for issues and statistics
spacy/cli/debug_data.py - Vocab (class) — Vocabulary storage managing string-to-hash mappings and lexical attributes
spacy/vocab.py - DocBin (class) — Efficient serialization container for storing multiple Doc objects
spacy/tokens/ - registry (module) — Component registry system for registering custom architectures, losses, and optimizers
spacy/util.py - Alignment (class) — Bidirectional token alignment between predicted and gold annotations
spacy/training/alignment.py
Sub-Modules
React-based documentation site with interactive demos and API reference
Configuration
spacy/pipeline/_edit_tree_internals/schemas.py (python-pydantic)
prefix_len(StrictInt, unknown) — default: Field(..., title="Prefix length")suffix_len(StrictInt, unknown) — default: Field(..., title="Suffix length")prefix_tree(StrictInt, unknown) — default: Field(..., title="Prefix tree")suffix_tree(StrictInt, unknown) — default: Field(..., title="Suffix tree")
spacy/pipeline/_edit_tree_internals/schemas.py (python-pydantic)
orig(Union[int, StrictStr], unknown) — default: Field(..., title="Original substring")subst(Union[int, StrictStr], unknown) — default: Field(..., title="Replacement substring")
spacy/schemas.py (python-pydantic)
IN(Optional[List[StrictStr]], unknown) — default: Field(None, alias="in")NOT_IN(Optional[List[StrictStr]], unknown) — default: Field(None, alias="not_in")IS_SUBSET(Optional[List[StrictStr]], unknown) — default: Field(None, alias="is_subset")IS_SUPERSET(Optional[List[StrictStr]], unknown) — default: Field(None, alias="is_superset")INTERSECTS(Optional[List[StrictStr]], unknown) — default: Field(None, alias="intersects")
spacy/schemas.py (python-pydantic)
REGEX(Optional[StrictStr], unknown) — default: Field(None, alias="regex")IN(Optional[List[StrictInt]], unknown) — default: Field(None, alias="in")NOT_IN(Optional[List[StrictInt]], unknown) — default: Field(None, alias="not_in")IS_SUBSET(Optional[List[StrictInt]], unknown) — default: Field(None, alias="is_subset")IS_SUPERSET(Optional[List[StrictInt]], unknown) — default: Field(None, alias="is_superset")INTERSECTS(Optional[List[StrictInt]], unknown) — default: Field(None, alias="intersects")EQ(Optional[Union[StrictInt, StrictFloat]], unknown) — default: Field(None, alias="==")NEQ(Optional[Union[StrictInt, StrictFloat]], unknown) — default: Field(None, alias="!=")- +4 more parameters
Science Pipeline
- Load training data — Convert various formats (CoNLL, JSON) to spaCy Doc objects [variable text lengths → (n_docs, variable_tokens)]
spacy/cli/convert.py - Tokenization — Split text into tokens using language-specific rules [raw text strings → (n_tokens, token_features)]
spacy/lang/ - Feature extraction — Extract token embeddings through tok2vec component [(n_tokens, vocab_size) → (n_tokens, embedding_dim)]
spacy/pipeline/ - Component processing — Apply NER, POS tagging, parsing through neural networks [(n_tokens, embedding_dim) → (n_tokens, n_classes)]
spacy/pipeline/ - Alignment — Align predicted tokens with gold standard for training [(pred_tokens, gold_tokens) → Alignment(x2y, y2x)]
spacy/training/alignment.py - Loss computation — Compute component-specific losses from aligned predictions [(n_tokens, n_classes) → scalar loss]
spacy/pipeline/
Assumptions & Constraints
- [warning] Assumes bidirectional alignment arrays x2y and y2x have compatible lengths but no validation enforces this (shape)
- [critical] Pipeline components assume Doc objects have consistent token indexing but no assertion verifies this (format)
- [critical] Training examples assume predicted and gold Doc objects have alignable token structures (dependency)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is spaCy used for?
Industrial-strength NLP library with neural models, pipelines and training for 70+ languages explosion/spacy is a 12-component ml training written in Python. Highly interconnected — components depend on each other heavily. The codebase contains 954 files.
How is spaCy architected?
spaCy is organized into 5 architecture layers: Core NLP, Training System, CLI Interface, Language Support, and 1 more. Highly interconnected — components depend on each other heavily. This layered structure enables tight integration between components.
How does data flow through spaCy?
Data moves through 6 stages: Text Input → Tokenization → Pipeline Processing → Training Alignment → Loss Computation → .... Text flows through tokenization, then sequential pipeline components (tagger, parser, NER, etc.), with training data flowing through alignment and loss computation This pipeline design reflects a complex multi-stage processing system.
What technologies does spaCy use?
The core stack includes Thinc (Neural network framework powering model training), Cython (Performance optimization for critical paths), Pydantic (Configuration validation and schema definition), NumPy (Numerical computations and array operations), Typer (CLI framework for command-line interface), pytest (Testing framework), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does spaCy have?
spaCy exhibits 3 data pools (Vocab Storage, Training Examples), 2 feedback loops, 5 control points, 2 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does spaCy use?
6 design patterns detected: Component Registry, Pipeline Architecture, Configuration Schema, CLI Command Pattern, Multi-language Support, and 1 more.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.