explosion/spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Industrial-strength NLP library that tokenizes, tags, parses, and extracts entities from text
Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation.
Under the hood, the system uses 4 feedback loops, 4 data pools, 7 control points to manage its runtime behavior.
A 9-component library. 954 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation.
- Text Tokenization — The Tokenizer splits raw text into Token objects using language-specific rules, exceptions, and Unicode segmentation, creating a Doc container with shared Vocab for string interning (config: tokenizer)
- Pipeline Processing — The Language orchestrator passes the Doc through each enabled pipeline component (tagger, parser, NER) in sequence, with each component adding annotations in-place [Doc → Doc] (config: nlp.pipeline, components)
- Training Data Loading — Corpus reader loads training files in .spacy/.json/.conllu formats, creates Example objects with predicted and reference Doc pairs for supervised learning (config: training.train_corpus, training.dev_corpus)
- Batch Formation — Training examples are grouped into batches of configurable size with proper alignment and padding for efficient neural network processing [Example → TrainingBatch] (config: training.batcher)
- Forward Pass — TrainablePipe components process batched examples through their neural networks, computing predictions and intermediate representations [TrainingBatch → Predictions] (config: components.*.model)
- Loss Computation — Each component compares predictions against gold annotations to compute task-specific losses (cross-entropy for classification, structured loss for parsing) [Predictions → Loss Values] (config: training.dropout)
- Gradient Update — The optimizer (Adam, SGD, etc.) computes gradients from losses and updates model parameters according to the configured learning rate and schedule [Loss Values → Updated Parameters] (config: training.optimizer, training.learning_rate)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
spacy/tokens/doc.pyxContainer with array of Token objects, spans, entities, and linguistic annotations with shared Vocab for string interning
Created by tokenizer from raw text, progressively annotated by pipeline components, and returned as final processed document
spacy/training/example.pyPydantic model with predicted: Doc, reference: Doc, and alignment data for training and evaluation
Created from training data with gold annotations, used to compute loss and gradients during training
spacy/schemas.pyPydantic models defining training config structure with nested sections for nlp, training, components, and initialization parameters
Loaded from YAML/JSON config files, validated against schema, and used to instantiate training components
spacy/training/List[Example] with consistent batch size and alignment information for efficient neural network processing
Created by batching Examples during training, fed to model for forward/backward passes
spacy/vocab.pyString store and lexeme database with vectors, morphology, and entity type mappings shared across documents
Initialized once per Language instance, populated during model loading, shared across all processed documents
spacy/schemas.pyPydantic model with lang: str, name: str, version: str, pipeline: List[str], and compatibility metadata
Loaded from model package metadata, used for version checking and pipeline configuration
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
GPU is available and functional when require_gpu() is called - no fallback mechanism exists
If this fails: CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted
spacy/cli/_util.py:require_gpu
Input text contains valid Unicode that won't cause tokenizer segmentation faults - raw bytes or malformed encoding can crash the Cython tokenizer
If this fails: Processing user-generated content or files with unknown encoding causes segmentation faults that terminate the Python process
spacy/language.py:nlp()
Training files (.spacy/.json/.conllu) fit in available RAM when loaded - no streaming or memory-mapped loading for large corpora
If this fails: Training crashes with OOM when dataset size approaches system memory limits, especially on cloud instances with limited RAM
spacy/training/corpus.py:Corpus
Example objects have aligned predicted and reference Doc objects with matching tokenization - misalignment causes silent gradient corruption
If this fails: Training produces models that perform worse than random on misaligned data, but loss curves look normal so the problem goes undetected
spacy/pipeline/TrainablePipe.update()
String store won't exceed 2^32 unique strings - uses 32-bit integer IDs for string interning
If this fails: Processing very large corpora with millions of unique tokens causes integer overflow, leading to hash collisions and corrupted vocabulary mappings
spacy/vocab.py:Vocab
Model checkpoints remain valid across training restarts - no version checking for config schema changes or component updates
If this fails: Resuming training from old checkpoints after spaCy updates silently loads incompatible weights, causing training instability or wrong predictions
spacy/cli/train.py:training_loop
Pipeline components are applied in the exact order specified in config.nlp.pipeline - no dependency resolution or validation
If this fails: Incorrectly ordered components (e.g., NER before tagger) produce degraded results without errors, making debugging pipeline performance issues difficult
spacy/language.py:pipeline
Shell command execution inherits safe PATH and environment variables - no sanitization of subprocess execution context
If this fails: Malicious model packages or config files can execute arbitrary shell commands during spacy download or training setup
spacy/cli/_util.py:run_command
All Examples in a training batch have compatible tensor shapes after padding - batch creation doesn't validate dimension consistency
If this fails: Mixed sequence lengths or incompatible feature dimensions in batches cause cryptic CUDA/PyTorch errors during forward pass
spacy/training/:batch_formation
Serialized Doc objects won't exceed available disk space or memory during bulk operations - no size estimation or chunking
If this fails: Large corpus processing fills up disk space or exhausts memory during serialization, causing data loss if the process is interrupted
spacy/tokens/docbin.py:DocBin.to_bytes()
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Global string interning table that maps strings to integer IDs, shared across all Doc objects for memory efficiency
Serialized model weights and configuration saved during training at configurable intervals for recovery and deployment
Cached processed training examples and computed features to avoid recomputation across epochs
Global registry mapping string names to component factories, optimizers, and other configurable objects
Feedback Loops
- Training Loop (training-loop, reinforcing) — Trigger: Training command with configured max_epochs. Action: Process batch, compute loss, update weights, evaluate on dev set. Exit: Max epochs reached or early stopping criteria met.
- Gradient Accumulation (gradient-accumulation, balancing) — Trigger: Batch size smaller than accumulation steps. Action: Accumulate gradients across mini-batches before updating parameters. Exit: Accumulation steps reached.
- Learning Rate Schedule (convergence, balancing) — Trigger: Evaluation score plateau detection. Action: Reduce learning rate by configured factor. Exit: Minimum learning rate reached.
- Early Stopping (convergence, balancing) — Trigger: No improvement in dev score for patience steps. Action: Monitor evaluation metrics and count stagnant steps. Exit: Training terminates early.
Delays
- Model Loading (warmup, ~1-10 seconds) — Initial pipeline instantiation and weight loading from disk before first prediction
- Training Evaluation (scheduled-job, ~varies by data size) — Periodic evaluation on dev set pauses training at configured intervals
- Checkpoint Saving (checkpoint-save, ~1-30 seconds) — Serializing model state to disk blocks training progress temporarily
- Data Loading (batch-window, ~milliseconds per batch) — Reading and parsing training files creates small delays between batches
Control Points
- Pipeline Components (architecture-switch) — Controls: Which NLP components are loaded and in what order. Default: nlp.pipeline
- Batch Size (hyperparameter) — Controls: Number of examples processed together, affecting memory usage and gradient stability. Default: training.batcher
- Learning Rate (hyperparameter) — Controls: Step size for gradient updates, critical for training convergence. Default: training.optimizer
- GPU Usage (device-selection) — Controls: Whether to use GPU acceleration for training and inference. Default: training.gpu_allocator
- Dropout Rate (hyperparameter) — Controls: Regularization strength during training to prevent overfitting. Default: training.dropout
- Model Architecture (architecture-switch) — Controls: Neural network structure and size for each pipeline component. Default: components.*.model
- Evaluation Frequency (threshold) — Controls: How often to evaluate on dev set during training (in steps). Default: training.eval_frequency
Technology Stack
Neural network framework providing model definition, training loops, and optimization algorithms
Performance-critical components compiled to C extensions for fast token processing and data structures
Numerical computing for tensor operations and efficient array manipulation in neural networks
Configuration schema validation and serialization for type-safe config parsing
CLI framework providing command routing, argument parsing, and help generation
Human-readable configuration format for training parameters and pipeline setup
Integration with Hugging Face models for modern neural architectures like BERT and GPT
Comprehensive test suite for unit, integration, and regression testing
Key Components
- Language (orchestrator) — Main orchestrator that loads models, manages the processing pipeline, and coordinates text processing through sequential component stages
spacy/language.py - CLI App (dispatcher) — Typer-based command dispatcher that routes CLI commands to appropriate handlers for training, evaluation, debugging, and data management
spacy/cli/_util.py - TrainablePipe (processor) — Base class for neural pipeline components that can be trained, updated, and serialized with consistent interfaces for prediction and learning
spacy/pipeline/ - Corpus (loader) — Data loader that reads training files in various formats (.spacy, .json, .conllu) and yields Example objects for training
spacy/training/corpus.py - Config System (registry) — Configuration registry that maps string identifiers to Python functions for components, optimizers, and other configurable objects using Thinc's registry
spacy/util.py - Tokenizer (transformer) — Language-specific text segmentation that splits raw text into Token objects using rules, exceptions, and character patterns
spacy/tokenizer.py - Training Loop (executor) — Main training orchestrator that manages epochs, batching, gradient updates, evaluation, and checkpoint saving with configurable optimization
spacy/cli/train.py - DocBin (serializer) — Efficient binary serialization format for Doc objects that preserves annotations while enabling fast loading and minimal disk usage
spacy/tokens/docbin.py - Matcher (processor) — Rule-based pattern matching system that finds token sequences based on lexical attributes, POS tags, and dependency relations
spacy/matcher/
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is spaCy used for?
Industrial-strength NLP library that tokenizes, tags, parses, and extracts entities from text explosion/spacy is a 9-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 954 files.
How is spaCy architected?
spaCy is organized into 5 architecture layers: Core Library, Pipeline Components, Training System, CLI Tools, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through spaCy?
Data moves through 7 stages: Text Tokenization → Pipeline Processing → Training Data Loading → Batch Formation → Forward Pass → .... Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation. This pipeline design reflects a complex multi-stage processing system.
What technologies does spaCy use?
The core stack includes Thinc (Neural network framework providing model definition, training loops, and optimization algorithms), Cython (Performance-critical components compiled to C extensions for fast token processing and data structures), NumPy (Numerical computing for tensor operations and efficient array manipulation in neural networks), Pydantic (Configuration schema validation and serialization for type-safe config parsing), Typer (CLI framework providing command routing, argument parsing, and help generation), YAML/JSON (Human-readable configuration format for training parameters and pipeline setup), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does spaCy have?
spaCy exhibits 4 data pools (Vocab String Store, Model Checkpoints), 4 feedback loops, 7 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does spaCy use?
5 design patterns detected: Plugin Architecture, Configuration-Driven Design, Shared State Management, Command Pattern, Pipeline Pattern.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.