explosion/spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

33,492 stars Python 9 components

Industrial-strength NLP library that tokenizes, tags, parses, and extracts entities from text

Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation.

Under the hood, the system uses 4 feedback loops, 4 data pools, 7 control points to manage its runtime behavior.

A 9-component library. 954 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation.

  1. Text Tokenization — The Tokenizer splits raw text into Token objects using language-specific rules, exceptions, and Unicode segmentation, creating a Doc container with shared Vocab for string interning (config: tokenizer)
  2. Pipeline Processing — The Language orchestrator passes the Doc through each enabled pipeline component (tagger, parser, NER) in sequence, with each component adding annotations in-place [Doc → Doc] (config: nlp.pipeline, components)
  3. Training Data Loading — Corpus reader loads training files in .spacy/.json/.conllu formats, creates Example objects with predicted and reference Doc pairs for supervised learning (config: training.train_corpus, training.dev_corpus)
  4. Batch Formation — Training examples are grouped into batches of configurable size with proper alignment and padding for efficient neural network processing [Example → TrainingBatch] (config: training.batcher)
  5. Forward Pass — TrainablePipe components process batched examples through their neural networks, computing predictions and intermediate representations [TrainingBatch → Predictions] (config: components.*.model)
  6. Loss Computation — Each component compares predictions against gold annotations to compute task-specific losses (cross-entropy for classification, structured loss for parsing) [Predictions → Loss Values] (config: training.dropout)
  7. Gradient Update — The optimizer (Adam, SGD, etc.) computes gradients from losses and updates model parameters according to the configured learning rate and schedule [Loss Values → Updated Parameters] (config: training.optimizer, training.learning_rate)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Doc spacy/tokens/doc.pyx
Container with array of Token objects, spans, entities, and linguistic annotations with shared Vocab for string interning
Created by tokenizer from raw text, progressively annotated by pipeline components, and returned as final processed document
Example spacy/training/example.py
Pydantic model with predicted: Doc, reference: Doc, and alignment data for training and evaluation
Created from training data with gold annotations, used to compute loss and gradients during training
ConfigSchema spacy/schemas.py
Pydantic models defining training config structure with nested sections for nlp, training, components, and initialization parameters
Loaded from YAML/JSON config files, validated against schema, and used to instantiate training components
TrainingBatch spacy/training/
List[Example] with consistent batch size and alignment information for efficient neural network processing
Created by batching Examples during training, fed to model for forward/backward passes
Vocab spacy/vocab.py
String store and lexeme database with vectors, morphology, and entity type mappings shared across documents
Initialized once per Language instance, populated during model loading, shared across all processed documents
ModelMeta spacy/schemas.py
Pydantic model with lang: str, name: str, version: str, pipeline: List[str], and compatibility metadata
Loaded from model package metadata, used for version checking and pipeline configuration

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

GPU is available and functional when require_gpu() is called - no fallback mechanism exists

If this fails: CLI training commands crash with unhelpful errors if GPU drivers are broken, CUDA is misconfigured, or GPU memory is exhausted

spacy/cli/_util.py:require_gpu
critical Domain unguarded

Input text contains valid Unicode that won't cause tokenizer segmentation faults - raw bytes or malformed encoding can crash the Cython tokenizer

If this fails: Processing user-generated content or files with unknown encoding causes segmentation faults that terminate the Python process

spacy/language.py:nlp()
critical Resource unguarded

Training files (.spacy/.json/.conllu) fit in available RAM when loaded - no streaming or memory-mapped loading for large corpora

If this fails: Training crashes with OOM when dataset size approaches system memory limits, especially on cloud instances with limited RAM

spacy/training/corpus.py:Corpus
critical Contract unguarded

Example objects have aligned predicted and reference Doc objects with matching tokenization - misalignment causes silent gradient corruption

If this fails: Training produces models that perform worse than random on misaligned data, but loss curves look normal so the problem goes undetected

spacy/pipeline/TrainablePipe.update()
critical Scale unguarded

String store won't exceed 2^32 unique strings - uses 32-bit integer IDs for string interning

If this fails: Processing very large corpora with millions of unique tokens causes integer overflow, leading to hash collisions and corrupted vocabulary mappings

spacy/vocab.py:Vocab
warning Temporal weakly guarded

Model checkpoints remain valid across training restarts - no version checking for config schema changes or component updates

If this fails: Resuming training from old checkpoints after spaCy updates silently loads incompatible weights, causing training instability or wrong predictions

spacy/cli/train.py:training_loop
warning Ordering unguarded

Pipeline components are applied in the exact order specified in config.nlp.pipeline - no dependency resolution or validation

If this fails: Incorrectly ordered components (e.g., NER before tagger) produce degraded results without errors, making debugging pipeline performance issues difficult

spacy/language.py:pipeline
warning Environment unguarded

Shell command execution inherits safe PATH and environment variables - no sanitization of subprocess execution context

If this fails: Malicious model packages or config files can execute arbitrary shell commands during spacy download or training setup

spacy/cli/_util.py:run_command
warning Shape weakly guarded

All Examples in a training batch have compatible tensor shapes after padding - batch creation doesn't validate dimension consistency

If this fails: Mixed sequence lengths or incompatible feature dimensions in batches cause cryptic CUDA/PyTorch errors during forward pass

spacy/training/:batch_formation
warning Resource unguarded

Serialized Doc objects won't exceed available disk space or memory during bulk operations - no size estimation or chunking

If this fails: Large corpus processing fills up disk space or exhausts memory during serialization, causing data loss if the process is interrupted

spacy/tokens/docbin.py:DocBin.to_bytes()

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Vocab String Store (registry)
Global string interning table that maps strings to integer IDs, shared across all Doc objects for memory efficiency
Model Checkpoints (file-store)
Serialized model weights and configuration saved during training at configurable intervals for recovery and deployment
Training Cache (in-memory)
Cached processed training examples and computed features to avoid recomputation across epochs
Component Registry (registry)
Global registry mapping string names to component factories, optimizers, and other configurable objects

Feedback Loops

Delays

Control Points

Technology Stack

Thinc (framework)
Neural network framework providing model definition, training loops, and optimization algorithms
Cython (runtime)
Performance-critical components compiled to C extensions for fast token processing and data structures
NumPy (compute)
Numerical computing for tensor operations and efficient array manipulation in neural networks
Pydantic (serialization)
Configuration schema validation and serialization for type-safe config parsing
Typer (framework)
CLI framework providing command routing, argument parsing, and help generation
YAML/JSON (serialization)
Human-readable configuration format for training parameters and pipeline setup
Transformers (library)
Integration with Hugging Face models for modern neural architectures like BERT and GPT
pytest (testing)
Comprehensive test suite for unit, integration, and regression testing

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is spaCy used for?

Industrial-strength NLP library that tokenizes, tags, parses, and extracts entities from text explosion/spacy is a 9-component library written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 954 files.

How is spaCy architected?

spaCy is organized into 5 architecture layers: Core Library, Pipeline Components, Training System, CLI Tools, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through spaCy?

Data moves through 7 stages: Text Tokenization → Pipeline Processing → Training Data Loading → Batch Formation → Forward Pass → .... Text enters through the Language.nlp() method, gets tokenized into a Doc object, then flows through a configurable pipeline of components (tagger, parser, NER, etc.) that progressively add linguistic annotations. During training, the system loads Example objects containing gold annotations, computes predictions through the same pipeline, calculates losses, and updates model weights through backpropagation. This pipeline design reflects a complex multi-stage processing system.

What technologies does spaCy use?

The core stack includes Thinc (Neural network framework providing model definition, training loops, and optimization algorithms), Cython (Performance-critical components compiled to C extensions for fast token processing and data structures), NumPy (Numerical computing for tensor operations and efficient array manipulation in neural networks), Pydantic (Configuration schema validation and serialization for type-safe config parsing), Typer (CLI framework providing command routing, argument parsing, and help generation), YAML/JSON (Human-readable configuration format for training parameters and pipeline setup), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does spaCy have?

spaCy exhibits 4 data pools (Vocab String Store, Model Checkpoints), 4 feedback loops, 7 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does spaCy use?

5 design patterns detected: Plugin Architecture, Configuration-Driven Design, Shared State Management, Command Pattern, Pipeline Pattern.

Analyzed on April 20, 2026 by CodeSea. Written by .