explosion/spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

33,403 stars Python 12 components 22 connections

Industrial-strength NLP library with neural models, pipelines and training for 70+ languages

Text flows through tokenization, then sequential pipeline components (tagger, parser, NER, etc.), with training data flowing through alignment and loss computation

Under the hood, the system uses 2 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.

Structural Verdict

A 12-component ml training with 22 connections. 954 files analyzed. Highly interconnected — components depend on each other heavily.

How Data Flows Through the System

Text flows through tokenization, then sequential pipeline components (tagger, parser, NER, etc.), with training data flowing through alignment and loss computation

  1. Text Input — Raw text strings are input to the Language pipeline (config: tokenizer)
  2. Tokenization — Text is split into tokens using language-specific rules (config: tokenizer)
  3. Pipeline Processing — Tokens pass through configured pipeline components (tagger, parser, NER) (config: pipeline, components)
  4. Training Alignment — During training, predicted annotations are aligned with gold standard (config: training.batcher)
  5. Loss Computation — Component losses are computed from aligned predictions and targets (config: training.dropout, training.optimizer)
  6. Model Updates — Gradients are backpropagated to update model weights (config: training.optimizer, training.max_epochs)

System Behavior

How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Vocab Storage (in-memory)
Hash-to-string mappings and lexical attributes accumulate in vocabulary
Training Examples (buffer)
Batches of aligned training examples buffer during training
Model Registry (in-memory)
Registered components, architectures, and functions accumulate in registry

Feedback Loops

Delays & Async Processing

Control Points

Technology Stack

Thinc (framework)
Neural network framework powering model training
Cython (build)
Performance optimization for critical paths
Pydantic (library)
Configuration validation and schema definition
NumPy (library)
Numerical computations and array operations
Typer (library)
CLI framework for command-line interface
pytest (testing)
Testing framework
React (framework)
Documentation website frontend
setuptools (build)
Package building and distribution

Key Components

Sub-Modules

Website Documentation (independence: high)
React-based documentation site with interactive demos and API reference

Configuration

spacy/pipeline/_edit_tree_internals/schemas.py (python-pydantic)

spacy/pipeline/_edit_tree_internals/schemas.py (python-pydantic)

spacy/schemas.py (python-pydantic)

spacy/schemas.py (python-pydantic)

Science Pipeline

  1. Load training data — Convert various formats (CoNLL, JSON) to spaCy Doc objects [variable text lengths → (n_docs, variable_tokens)] spacy/cli/convert.py
  2. Tokenization — Split text into tokens using language-specific rules [raw text strings → (n_tokens, token_features)] spacy/lang/
  3. Feature extraction — Extract token embeddings through tok2vec component [(n_tokens, vocab_size) → (n_tokens, embedding_dim)] spacy/pipeline/
  4. Component processing — Apply NER, POS tagging, parsing through neural networks [(n_tokens, embedding_dim) → (n_tokens, n_classes)] spacy/pipeline/
  5. Alignment — Align predicted tokens with gold standard for training [(pred_tokens, gold_tokens) → Alignment(x2y, y2x)] spacy/training/alignment.py
  6. Loss computation — Compute component-specific losses from aligned predictions [(n_tokens, n_classes) → scalar loss] spacy/pipeline/

Assumptions & Constraints

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is spaCy used for?

Industrial-strength NLP library with neural models, pipelines and training for 70+ languages explosion/spacy is a 12-component ml training written in Python. Highly interconnected — components depend on each other heavily. The codebase contains 954 files.

How is spaCy architected?

spaCy is organized into 5 architecture layers: Core NLP, Training System, CLI Interface, Language Support, and 1 more. Highly interconnected — components depend on each other heavily. This layered structure enables tight integration between components.

How does data flow through spaCy?

Data moves through 6 stages: Text Input → Tokenization → Pipeline Processing → Training Alignment → Loss Computation → .... Text flows through tokenization, then sequential pipeline components (tagger, parser, NER, etc.), with training data flowing through alignment and loss computation This pipeline design reflects a complex multi-stage processing system.

What technologies does spaCy use?

The core stack includes Thinc (Neural network framework powering model training), Cython (Performance optimization for critical paths), Pydantic (Configuration validation and schema definition), NumPy (Numerical computations and array operations), Typer (CLI framework for command-line interface), pytest (Testing framework), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does spaCy have?

spaCy exhibits 3 data pools (Vocab Storage, Training Examples), 2 feedback loops, 5 control points, 2 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does spaCy use?

6 design patterns detected: Component Registry, Pipeline Architecture, Configuration Schema, CLI Command Pattern, Multi-language Support, and 1 more.

Analyzed on March 31, 2026 by CodeSea. Written by .