eleutherai/gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

7,405 stars Python 10 components 6 connections

Large-scale autoregressive transformer training framework using Megatron and DeepSpeed

Text data flows from preprocessed indexed datasets through distributed dataloaders to model parallel transformers, with checkpoints saved periodically and evaluation metrics logged to W&B

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

Structural Verdict

A 10-component ml training with 6 connections. 121 files analyzed. Loosely coupled — components are relatively independent.

How Data Flows Through the System

Text data flows from preprocessed indexed datasets through distributed dataloaders to model parallel transformers, with checkpoints saved periodically and evaluation metrics logged to W&B

  1. Data Preprocessing — Raw text tokenized and packed into indexed datasets using prepare_data.py (config: data_path, vocab_file, seq_length)
  2. Dataset Loading — GPT2Dataset loads token sequences with BlendableDataset for weighted sampling (config: data_path, train_data_paths, valid_data_paths +1)
  3. Distributed Batching — DistributedBatchSampler ensures consistent batches across data parallel workers (config: batch_size, micro_batch_size, data_parallel_size)
  4. Model Forward Pass — Transformer processes sequences with tensor/pipeline parallelism across GPUs (config: num_layers, hidden_size, num_attention_heads +2)
  5. Loss Computation — Cross-entropy loss calculated with gradient accumulation for large effective batch sizes (config: gradient_accumulation_steps, loss_scale)
  6. Optimizer Step — DeepSpeed optimizer updates model weights with ZeRO memory optimization (config: optimizer, lr, weight_decay +1)
  7. Checkpointing — Model state saved to distributed checkpoints at regular intervals (config: save, save_interval, checkpoint_factor)
  8. Evaluation — Model evaluated on validation tasks using lm-evaluation-harness (config: eval_tasks, eval_interval)

System Behavior

How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Indexed Datasets (file-store)
Memory-mapped tokenized text sequences for training and validation
Model Checkpoints (file-store)
Distributed model state snapshots saved periodically during training
GPU Memory Cache (cache)
KV cache for attention layers during inference and generation
Gradient Buffers (buffer)
Accumulated gradients across microbatches before optimizer step

Feedback Loops

Delays & Async Processing

Control Points

Technology Stack

PyTorch (framework)
Deep learning framework
DeepSpeed (framework)
Distributed training optimization
Megatron-LM (framework)
Model parallel transformer implementation
CUDA (infra)
GPU acceleration and custom kernels
Weights & Biases (infra)
Experiment tracking and logging
lm-evaluation-harness (testing)
Standardized language model evaluation
NumPy (library)
Numerical computations and data handling
boto3 (infra)
S3 checkpoint storage

Key Components

Sub-Modules

Core Training Framework (independence: low)
Main distributed training loop with model parallel transformers
Evaluation Suite (independence: medium)
Model evaluation using lm-evaluation-harness tasks
Text Generation (independence: medium)
Autoregressive text generation and inference
Data Preprocessing (independence: high)
Convert raw text to tokenized indexed datasets

Configuration

configs/1-3B-transformer-engine.yml (yaml)

configs/1-3B.yml (yaml)

configs/125M-dmoe.yml (yaml)

configs/125M-json.yml (yaml)

Science Pipeline

  1. Tokenization — Raw text converted to token IDs using vocab_file tokenizer [Variable length strings → (num_documents, variable_length)] prepare_data.py
  2. Sequence Packing — Token sequences packed into fixed-length training samples [(num_documents, variable_length) → (num_samples, seq_length)] megatron/data/gpt2_dataset.py
  3. Embedding Lookup — Token IDs converted to dense embeddings with positional encoding [(batch_size, seq_length) → (batch_size, seq_length, hidden_size)] megatron/model/language_model.py
  4. Transformer Layers — Self-attention and feedforward processing across num_layers [(batch_size, seq_length, hidden_size) → (batch_size, seq_length, hidden_size)] megatron/model/transformer.py
  5. Language Modeling Head — Hidden states projected to vocabulary logits for next token prediction [(batch_size, seq_length, hidden_size) → (batch_size, seq_length, vocab_size)] megatron/model/language_model.py

Assumptions & Constraints

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is gpt-neox used for?

Large-scale autoregressive transformer training framework using Megatron and DeepSpeed eleutherai/gpt-neox is a 10-component ml training written in Python. Loosely coupled — components are relatively independent. The codebase contains 121 files.

How is gpt-neox architected?

gpt-neox is organized into 5 architecture layers: Entry Scripts, Configuration Layer, Core Engine, Data Processing, and 1 more. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.

How does data flow through gpt-neox?

Data moves through 8 stages: Data Preprocessing → Dataset Loading → Distributed Batching → Model Forward Pass → Loss Computation → .... Text data flows from preprocessed indexed datasets through distributed dataloaders to model parallel transformers, with checkpoints saved periodically and evaluation metrics logged to W&B This pipeline design reflects a complex multi-stage processing system.

What technologies does gpt-neox use?

The core stack includes PyTorch (Deep learning framework), DeepSpeed (Distributed training optimization), Megatron-LM (Model parallel transformer implementation), CUDA (GPU acceleration and custom kernels), Weights & Biases (Experiment tracking and logging), lm-evaluation-harness (Standardized language model evaluation), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does gpt-neox have?

gpt-neox exhibits 4 data pools (Indexed Datasets, Model Checkpoints), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does gpt-neox use?

5 design patterns detected: Model Parallel Architecture, Configuration-Driven Design, Distributed Training Orchestration, Memory-Mapped Data Loading, CUDA Kernel Optimization.

Analyzed on March 31, 2026 by CodeSea. Written by .