explosion/spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

33,492 stars Python 9 components

Industrial-strength NLP library that tokenizes, tags, parses, and extracts entities from text

Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation.

Under the hood, the system uses 4 feedback loops, 4 data pools, 7 control points to manage its runtime behavior.

A 9-component library. 954 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Text Tokenization — The Tokenizer splits raw text into Token objects using language-specific rules, exceptions, and Unicode segmentation, creating a Doc container with shared Vocab for string interning (config: tokenizer)
Pipeline Processing — The Language orchestrator passes the Doc through each enabled pipeline component (tagger, parser, NER) in sequence, with each component adding annotations in-place [Doc → Doc] (config: nlp.pipeline, components)
Training Data Loading — Corpus reader loads training files in .spacy/.json/.conllu formats, creates Example objects with predicted and reference Doc pairs for supervised learning (config: training.train_corpus, training.dev_corpus)
Batch Formation — Training examples are grouped into batches of configurable size with proper alignment and padding for efficient neural network processing [Example → TrainingBatch] (config: training.batcher)
Forward Pass — TrainablePipe components process batched examples through their neural networks, computing predictions and intermediate representations [TrainingBatch → Predictions] (config: components.*.model)
Loss Computation — Each component compares predictions against gold annotations to compute task-specific losses (cross-entropy for classification, structured loss for parsing) [Predictions → Loss Values] (config: training.dropout)
Gradient Update — The optimizer (Adam, SGD, etc.) computes gradients from losses and updates model parameters according to the configured learning rate and schedule [Loss Values → Updated Parameters] (config: training.optimizer, training.learning_rate)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Doc spacy/tokens/doc.pyx
Container with array of Token objects, spans, entities, and linguistic annotations with shared Vocab for string interning
Created by tokenizer from raw text, progressively annotated by pipeline components, and returned as final processed document

Example spacy/training/example.py
Pydantic model with predicted: Doc, reference: Doc, and alignment data for training and evaluation
Created from training data with gold annotations, used to compute loss and gradients during training

ConfigSchema spacy/schemas.py
Pydantic models defining training config structure with nested sections for nlp, training, components, and initialization parameters
Loaded from YAML/JSON config files, validated against schema, and used to instantiate training components

TrainingBatch spacy/training/
List[Example] with consistent batch size and alignment information for efficient neural network processing
Created by batching Examples during training, fed to model for forward/backward passes

Vocab spacy/vocab.py
String store and lexeme database with vectors, morphology, and entity type mappings shared across documents
Initialized once per Language instance, populated during model loading, shared across all processed documents

ModelMeta spacy/schemas.py
Pydantic model with lang: str, name: str, version: str, pipeline: List[str], and compatibility metadata
Loaded from model package metadata, used for version checking and pipeline configuration

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

GPU is available and functional when require_gpu() is called - no fallback mechanism exists

If this fails: CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted

spacy/cli/_util.py:require_gpu

critical Domain unguarded

Input text contains valid Unicode that won't cause tokenizer segmentation faults - raw bytes or malformed encoding can crash the Cython tokenizer

If this fails: Processing user-generated content or files with unknown encoding causes segmentation faults that terminate the Python process

spacy/language.py:nlp()

critical Resource unguarded

Training files (.spacy/.json/.conllu) fit in available RAM when loaded - no streaming or memory-mapped loading for large corpora

If this fails: Training crashes with OOM when dataset size approaches system memory limits, especially on cloud instances with limited RAM

spacy/training/corpus.py:Corpus

critical Contract unguarded

Example objects have aligned predicted and reference Doc objects with matching tokenization - misalignment causes silent gradient corruption

If this fails: Training produces models that perform worse than random on misaligned data, but loss curves look normal so the problem goes undetected

spacy/pipeline/TrainablePipe.update()

critical Scale unguarded

String store won't exceed 2^32 unique strings - uses 32-bit integer IDs for string interning

If this fails: Processing very large corpora with millions of unique tokens causes integer overflow, leading to hash collisions and corrupted vocabulary mappings

spacy/vocab.py:Vocab

warning Temporal weakly guarded

Model checkpoints remain valid across training restarts - no version checking for config schema changes or component updates

If this fails: Resuming training from old checkpoints after spaCy updates silently loads incompatible weights, causing training instability or wrong predictions

spacy/cli/train.py:training_loop

warning Ordering unguarded

Pipeline components are applied in the exact order specified in config.nlp.pipeline - no dependency resolution or validation

If this fails: Incorrectly ordered components (e.g., NER before tagger) produce degraded results without errors, making debugging pipeline performance issues difficult

spacy/language.py:pipeline

warning Environment unguarded

Shell command execution inherits safe PATH and environment variables - no sanitization of subprocess execution context

If this fails: Malicious model packages or config files can execute arbitrary shell commands during spacy download or training setup

spacy/cli/_util.py:run_command

warning Shape weakly guarded

All Examples in a training batch have compatible tensor shapes after padding - batch creation doesn't validate dimension consistency

If this fails: Mixed sequence lengths or incompatible feature dimensions in batches cause cryptic CUDA/PyTorch errors during forward pass

spacy/training/:batch_formation

warning Resource unguarded

Serialized Doc objects won't exceed available disk space or memory during bulk operations - no size estimation or chunking

If this fails: Large corpus processing fills up disk space or exhausts memory during serialization, causing data loss if the process is interrupted

spacy/tokens/docbin.py:DocBin.to_bytes()

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Vocab String Store (registry)
Global string interning table that maps strings to integer IDs, shared across all Doc objects for memory efficiency

Model Checkpoints (file-store)
Serialized model weights and configuration saved during training at configurable intervals for recovery and deployment

Training Cache (in-memory)
Cached processed training examples and computed features to avoid recomputation across epochs

Component Registry (registry)
Global registry mapping string names to component factories, optimizers, and other configurable objects

Feedback Loops

Training Loop (training-loop, reinforcing) — Trigger: Training command with configured max_epochs. Action: Process batch, compute loss, update weights, evaluate on dev set. Exit: Max epochs reached or early stopping criteria met.
Gradient Accumulation (gradient-accumulation, balancing) — Trigger: Batch size smaller than accumulation steps. Action: Accumulate gradients across mini-batches before updating parameters. Exit: Accumulation steps reached.
Learning Rate Schedule (convergence, balancing) — Trigger: Evaluation score plateau detection. Action: Reduce learning rate by configured factor. Exit: Minimum learning rate reached.
Early Stopping (convergence, balancing) — Trigger: No improvement in dev score for patience steps. Action: Monitor evaluation metrics and count stagnant steps. Exit: Training terminates early.

Delays

Model Loading (warmup, ~1-10 seconds) — Initial pipeline instantiation and weight loading from disk before first prediction
Training Evaluation (scheduled-job, ~varies by data size) — Periodic evaluation on dev set pauses training at configured intervals
Checkpoint Saving (checkpoint-save, ~1-30 seconds) — Serializing model state to disk blocks training progress temporarily
Data Loading (batch-window, ~milliseconds per batch) — Reading and parsing training files creates small delays between batches

Control Points

Pipeline Components (architecture-switch) — Controls: Which NLP components are loaded and in what order. Default: nlp.pipeline
Batch Size (hyperparameter) — Controls: Number of examples processed together, affecting memory usage and gradient stability. Default: training.batcher
Learning Rate (hyperparameter) — Controls: Step size for gradient updates, critical for training convergence. Default: training.optimizer
GPU Usage (device-selection) — Controls: Whether to use GPU acceleration for training and inference. Default: training.gpu_allocator
Dropout Rate (hyperparameter) — Controls: Regularization strength during training to prevent overfitting. Default: training.dropout
Model Architecture (architecture-switch) — Controls: Neural network structure and size for each pipeline component. Default: components.*.model
Evaluation Frequency (threshold) — Controls: How often to evaluate on dev set during training (in steps). Default: training.eval_frequency

Technology Stack

Thinc (framework)
Neural network framework providing model definition, training loops, and optimization algorithms

Cython (runtime)
Performance-critical components compiled to C extensions for fast token processing and data structures

NumPy (compute)
Numerical computing for tensor operations and efficient array manipulation in neural networks

Pydantic (serialization)
Configuration schema validation and serialization for type-safe config parsing

Typer (framework)
CLI framework providing command routing, argument parsing, and help generation

YAML/JSON (serialization)
Human-readable configuration format for training parameters and pipeline setup

Transformers (library)
Integration with Hugging Face models for modern neural architectures like BERT and GPT

pytest (testing)
Comprehensive test suite for unit, integration, and regression testing

Key Components

Language (orchestrator) — Main orchestrator that loads models, manages the processing pipeline, and coordinates text processing through sequential component stages spacy/language.py
CLI App (dispatcher) — Typer-based command dispatcher that routes CLI commands to appropriate handlers for training, evaluation, debugging, and data management spacy/cli/_util.py
TrainablePipe (processor) — Base class for neural pipeline components that can be trained, updated, and serialized with consistent interfaces for prediction and learning spacy/pipeline/
Corpus (loader) — Data loader that reads training files in various formats (.spacy, .json, .conllu) and yields Example objects for training spacy/training/corpus.py
Config System (registry) — Configuration registry that maps string identifiers to Python functions for components, optimizers, and other configurable objects using Thinc's registry spacy/util.py
Tokenizer (transformer) — Language-specific text segmentation that splits raw text into Token objects using rules, exceptions, and character patterns spacy/tokenizer.py
Training Loop (executor) — Main training orchestrator that manages epochs, batching, gradient updates, evaluation, and checkpoint saving with configurable optimization spacy/cli/train.py
DocBin (serializer) — Efficient binary serialization format for Doc objects that preserves annotations while enabling fast loading and minimal disk usage spacy/tokens/docbin.py
Matcher (processor) — Rule-based pattern matching system that finds token sequences based on lexical attributes, POS tags, and dependency relations spacy/matcher/

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is spaCy used for?

Industrial-strength NLP library that tokenizes, tags, parses, and extracts entities from text explosion/spacy is a 9-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 954 files.

How is spaCy architected?

spaCy is organized into 5 architecture layers: Core Library, Pipeline Components, Training System, CLI Tools, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through spaCy?

Data moves through 7 stages: Text Tokenization → Pipeline Processing → Training Data Loading → Batch Formation → Forward Pass → .... Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation. This pipeline design reflects a complex multi-stage processing system.

What technologies does spaCy use?

The core stack includes Thinc (Neural network framework providing model definition, training loops, and optimization algorithms), Cython (Performance-critical components compiled to C extensions for fast token processing and data structures), NumPy (Numerical computing for tensor operations and efficient array manipulation in neural networks), Pydantic (Configuration schema validation and serialization for type-safe config parsing), Typer (CLI framework providing command routing, argument parsing, and help generation), YAML/JSON (Human-readable configuration format for training parameters and pipeline setup), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does spaCy have?

spaCy exhibits 4 data pools (Vocab String Store, Model Checkpoints), 4 feedback loops, 7 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does spaCy use?

5 design patterns detected: Plugin Architecture, Configuration-Driven Design, Shared State Management, Command Pattern, Pipeline Pattern.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.