eleutherai/gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

7,421 stars Python 10 components

Trains billion-parameter autoregressive language models across multiple GPUs using model parallelism

Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 10-component ml training. 121 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Configuration and initialization — NeoXArgs.consume_deepy_args() parses YAML config files, validates all parameters, and initialize_model_parallel() creates process groups for tensor, pipeline, and data parallelism based on pipe_parallel_size and model_parallel_size settings (config: pipe_parallel_size, model_parallel_size, num_layers +2)
Data pipeline construction — get_train_data_loader() loads IndexedDatasets from data_path, creates BlendableDataset with train_data_paths and blend weights, then wraps with DistributedBatchSampler to ensure each worker gets non-overlapping batches of size micro_batch_size [DatasetDocument → TrainingBatch] (config: data_path, train_data_paths, train_blend +2)
Model forward pass — GPTModelPipe.forward() processes input tokens through embedding layer, then each ParallelTransformerLayer applies self-attention and MLP in parallel across tensor parallel groups, with activations communicated between pipeline stages until final language model head produces logits [TrainingBatch → ModelOutputs] (config: num_layers, hidden_size, num_attention_heads +3)
Loss computation and backward pass — forward_step() computes cross-entropy loss between model logits and shifted input labels using loss_mask to ignore padding tokens, then backward pass computes gradients while pipeline parallel stages communicate gradient tensors [ModelOutputs] (config: loss_scale, gradient_clipping)
Gradient synchronization and optimization — DeepSpeedEngine handles gradient reduction across data parallel groups, applies gradient clipping with clip_grad value, then optimizer.step() updates parameters using configured learning rate schedule and ZeRO optimizer states (config: learning_rate, lr_decay_style, clip_grad +2)
Checkpoint saving — save_checkpoint() periodically collects distributed model weights from all tensor parallel ranks, gathers optimizer states, and writes CheckpointState to save directory with iteration number and random states for exact resumption [CheckpointState] (config: save, save_interval, checkpoint_factor)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

NeoXArgs megatron/neox_arguments/neox_args.py
dataclass with 200+ fields including num_layers: int, hidden_size: int, num_attention_heads: int, seq_length: int, batch_size: int, learning_rate: float, pipe_parallel_size: int, model_parallel_size: int, plus training hyperparameters, data paths, and distributed settings
Parsed from YAML configs at startup, validated and distributed to all processes, then accessed throughout training to control model architecture, parallelism strategy, and hyperparameters

TrainingBatch megatron/data/gpt2_dataset.py
dict with tokens: Tensor[batch_size, seq_length] (input token IDs), labels: Tensor[batch_size, seq_length] (target tokens, offset by 1), attention_mask: Tensor[batch_size, seq_length] (padding mask), loss_mask: Tensor[batch_size, seq_length] (which positions to compute loss on)
Created by GPT2Dataset from indexed token sequences, batched by DataLoader with distributed sampling, then fed through model forward pass where tokens become logits and labels compute cross-entropy loss

ModelOutputs megatron/model/transformer.py
tuple containing logits: Tensor[batch_size, seq_length, vocab_size] (predicted token probabilities) and optionally hidden_states: List[Tensor] (intermediate layer outputs) when specified
Produced by transformer forward pass through embedding, attention layers, and final projection — consumed by loss function during training or by sampling logic during generation

CheckpointState megatron/checkpointing.py
dict with model: OrderedDict (model weights), optimizer: dict (optimizer states), lr_scheduler: dict (learning rate schedule state), iteration: int (global step count), args: NeoXArgs (full config), rng_states: dict (random number generator states for reproducibility)
Assembled during save_checkpoint() by collecting distributed model shards and optimizer states, serialized to disk, then restored during load_checkpoint() to resume training from exact state

DatasetDocument megatron/data/indexed_dataset.py
numpy array of token IDs with variable length representing one document, accessed via IndexedDataset with document boundaries marked in separate index
Pre-tokenized documents stored in binary format, loaded by IndexedDataset and sampled by GPT2Dataset to create training sequences of seq_length tokens

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

DeepSpeed launcher can successfully spawn workers and all workers can communicate over the network topology — assumes network interfaces, hostnames, and port ranges are available

If this fails: Training hangs indefinitely during distributed initialization if workers can't establish communication, with no clear error message about network connectivity issues

deepy.py:main

critical Domain unguarded

Dataset blend weights are semantically meaningful proportions that sum to reasonable values — accepts any positive float array without validating they represent valid sampling probabilities

If this fails: Weights like [1000000, 0.001] create extreme sampling bias where one dataset dominates training, silently producing models trained on unintended data distributions

megatron/data/blendable_dataset.py:BlendableDataset.__init__

critical Scale unguarded

Global batch size (batch_size * world_size) fits within system memory limits and doesn't exceed dataset size — computes without checking memory constraints or dataset bounds

If this fails: Out-of-memory crashes during batch creation or infinite data loader loops when global_batch_size exceeds total available samples

megatron/data/data_utils.py:make_data_loader

critical Temporal weakly guarded

Distributed checkpoint saving completes atomically across all workers — if one worker fails during save, assumes others will detect and handle cleanup

If this fails: Partial checkpoint corruption where some workers write state but others fail, leaving training in unrecoverable state requiring manual cleanup and rollback

megatron/checkpointing.py

critical Resource unguarded

GPU memory during evaluation is sufficient for model inference plus evaluation task data structures — doesn't account for peak memory during forward passes with long sequences

If this fails: OOM crashes during evaluation with sequences longer than training data, even though model loaded successfully, causing evaluation jobs to fail silently

eval_tasks/eval_adapter.py:EvalHarnessAdapter

warning Ordering unguarded

Environment variable injection (WANDB_API_KEY) occurs before DeepSpeed worker spawn — timing assumes synchronous environment setup

If this fails: Workers start without WandB credentials when environment setup races with process creation, causing silent logging failures that are only discovered after long training runs

deepy.py:main

warning Contract weakly guarded

setup_for_inference_or_eval() returns model in correct inference mode state — trusts returned model is ready for generation without verifying eval mode or gradient settings

If this fails: Generation produces inconsistent results when model remains in training mode with dropout enabled, or consumes unnecessary memory with gradient computation

generate.py:main

warning Environment unguarded

Storage backend (local filesystem or S3) has consistent write semantics — assumes atomic write operations and immediate read-after-write consistency

If this fails: Checkpoint corruption on distributed filesystems with eventual consistency, where workers read partially written checkpoint data during restoration

megatron/checkpointing.py

warning Scale unguarded

Number of data loading workers (num_workers) scales appropriately with available CPU cores and I/O bandwidth — uses configured value without system resource validation

If this fails: Performance degradation when num_workers exceeds CPU cores or saturates disk I/O, creating data loading bottlenecks that throttle GPU utilization

megatron/data/data_utils.py:make_data_loader

warning Domain unguarded

Evaluation harness tasks expect model outputs in specific tensor formats with consistent dtype and device placement — no validation of output compatibility

If this fails: Evaluation failures with cryptic errors when model produces outputs in unexpected formats (e.g., different precision, wrong device), making debugging difficult

eval.py:main

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

IndexedDataset cache (file-store)
Memory-mapped binary files containing pre-tokenized documents with separate index files for fast random access to document boundaries and token sequences

Distributed model parameters (in-memory)
Transformer weights sharded across tensor parallel groups — attention heads and MLP weights split across GPUs within each pipeline stage

ZeRO optimizer states (state-store)
Optimizer momentum and variance states partitioned across data parallel workers to reduce memory usage, gathered during checkpoint saving

Checkpoint storage (file-store)
Periodic snapshots of complete model state including weights, optimizer states, learning rate schedule, and random number generator states for training resumption

Feedback Loops

Training iteration loop (training-loop, reinforcing) — Trigger: train_iters configuration parameter. Action: Each iteration loads a batch, runs forward pass, computes loss, backpropagates gradients, and updates parameters via DeepSpeed optimizer. Exit: When iteration count reaches train_iters or manual termination.
Gradient accumulation cycle (gradient-accumulation, balancing) — Trigger: gradient_accumulation_steps > 1. Action: Accumulates gradients from multiple micro-batches before calling optimizer.step(), allowing larger effective batch sizes than GPU memory permits. Exit: After gradient_accumulation_steps micro-batches processed.
Learning rate scheduling (convergence, balancing) — Trigger: lr_decay_style configuration (cosine, linear, etc.). Action: Adjusts learning rate each step according to schedule — typically starting high and decaying to improve convergence. Exit: Training completion.
Checkpoint retry mechanism (retry, balancing) — Trigger: Checkpoint save/load failures. Action: Retries checkpoint operations with exponential backoff, especially important for distributed filesystems and S3 storage. Exit: Successful checkpoint operation or max retries exceeded.

Delays

Pipeline bubble delay (async-processing, ~depends on pipe_parallel_size) — Pipeline stages wait for previous stage outputs — first and last stages have idle time during pipeline fill/drain phases
Gradient synchronization delay (eventual-consistency, ~network latency dependent) — Data parallel workers wait for gradient reduction before optimizer step — scales with number of workers and network topology
Checkpoint save intervals (scheduled-job, ~save_interval steps) — Training pauses periodically to save distributed model state — duration depends on model size and storage speed
Dataset loading warmup (warmup, ~varies by dataset size) — Initial data loader creation maps indexed datasets into memory and builds sampling indices before first training step

Control Points

Model architecture switches (architecture-switch) — Controls: Fundamental model shape via num_layers, hidden_size, num_attention_heads — determines memory usage and computational requirements. Default: varies per config (125M to 175B parameter ranges)
Parallelism strategy (runtime-toggle) — Controls: How model is distributed via pipe_parallel_size (layers across GPUs) and model_parallel_size (tensors across GPUs). Default: typically 1-8 for each dimension
Precision mode selection (precision-mode) — Controls: Training precision via fp16, bf16, or fp32 settings — affects memory usage, speed, and numerical stability. Default: fp16 enabled in most configs
Learning rate schedule (hyperparameter) — Controls: Convergence behavior via learning_rate, lr_decay_style, warmup_iters — critical for training stability. Default: varies by model size, typically 1e-4 to 6e-4
Data blending weights (sampling-strategy) — Controls: Training data mixture via train_blend — determines what proportion of each dataset appears in training. Default: configured per training run

Technology Stack

PyTorch (framework)
Provides core tensor operations, autograd, and distributed primitives — the foundation for model implementation and gradient computation

DeepSpeed (framework)
Handles ZeRO optimizer state partitioning, mixed precision training, gradient scaling, and provides the launcher for multi-node training coordination

CUDA/cuDNN (compute)
GPU acceleration for matrix operations, attention kernels, and custom fused operations — critical for training performance on NVIDIA hardware

NCCL (infra)
GPU-to-GPU communication for gradient synchronization, tensor parallel all-reduce operations, and pipeline stage communication

Weights & Biases (library)
Training metrics logging, hyperparameter tracking, and experiment management — provides real-time monitoring of distributed training runs

Flash Attention (library)
Memory-efficient attention computation that reduces O(n²) memory usage to O(n) for long sequences — enables training on longer contexts

Megatron Tokenizer (library)
Provides GPT-2 BPE tokenization and vocab handling — converts text to token IDs that the model processes

lm-evaluation-harness (library)
Standardized evaluation on downstream tasks like LAMBADA, HellaSwag, and PIQA — enables consistent model comparison across the field

Key Components

NeoXArgs (registry) — Central configuration registry that parses YAML configs into validated dataclass, handles argument inheritance, and provides typed access to all training hyperparameters and system settings megatron/neox_arguments/neox_args.py
GPTModelPipe (orchestrator) — Main transformer model that orchestrates embedding layer, transformer blocks, and language model head — implements pipeline parallelism by splitting layers across pipeline stages megatron/model/gpt2_model.py
BlendableDataset (adapter) — Combines multiple datasets with configurable sampling weights, ensuring each training step draws from the specified mixture of data sources according to blend ratios megatron/data/blendable_dataset.py
DistributedDataParallel (orchestrator) — Manages gradient synchronization across data parallel groups, implements gradient clipping and accumulation, coordinates with tensor/pipeline parallel communication megatron/model/distributed.py
DeepSpeedEngine (optimizer) — Integrates DeepSpeed's ZeRO optimizer, gradient scaling, learning rate scheduling, and memory optimization — handles the actual parameter updates and mixed precision training megatron/training.py
ParallelTransformerLayer (processor) — Implements one transformer block with parallel attention and MLP computation, handles tensor parallel communication for weight matrices, applies layer norm and residual connections megatron/model/transformer.py
initialize_model_parallel (allocator) — Sets up process groups for tensor parallelism (splitting attention heads/MLP) and pipeline parallelism (splitting layers), creates communication topology for multi-dimensional parallelism megatron/mpu/initialize.py
get_train_data_loader (factory) — Constructs the training data pipeline by loading indexed datasets, applying blending weights, creating distributed samplers, and returning DataLoader with correct batch size for each worker megatron/data/data_utils.py
save_checkpoint (serializer) — Coordinates distributed checkpoint saving by gathering model shards from tensor parallel groups, collecting optimizer states, and writing checkpoint files with metadata for resumption megatron/checkpointing.py
EvalHarnessAdapter (adapter) — Bridges GPT-NeoX models with the lm-evaluation-harness framework by implementing the required interface for logit computation and text generation on evaluation tasks eval_tasks/eval_adapter.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is gpt-neox used for?

Trains billion-parameter autoregressive language models across multiple GPUs using model parallelism eleutherai/gpt-neox is a 10-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 121 files.

How is gpt-neox architected?

gpt-neox is organized into 4 architecture layers: Script Entry Points, Core Training Engine, Data Pipeline, Distributed Infrastructure. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through gpt-neox?

Data moves through 6 stages: Configuration and initialization → Data pipeline construction → Model forward pass → Loss computation and backward pass → Gradient synchronization and optimization → .... Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling. This pipeline design reflects a complex multi-stage processing system.

What technologies does gpt-neox use?

The core stack includes PyTorch (Provides core tensor operations, autograd, and distributed primitives — the foundation for model implementation and gradient computation), DeepSpeed (Handles ZeRO optimizer state partitioning, mixed precision training, gradient scaling, and provides the launcher for multi-node training coordination), CUDA/cuDNN (GPU acceleration for matrix operations, attention kernels, and custom fused operations — critical for training performance on NVIDIA hardware), NCCL (GPU-to-GPU communication for gradient synchronization, tensor parallel all-reduce operations, and pipeline stage communication), Weights & Biases (Training metrics logging, hyperparameter tracking, and experiment management — provides real-time monitoring of distributed training runs), Flash Attention (Memory-efficient attention computation that reduces O(n²) memory usage to O(n) for long sequences — enables training on longer contexts), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does gpt-neox have?

gpt-neox exhibits 4 data pools (IndexedDataset cache, Distributed model parameters), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does gpt-neox use?

5 design patterns detected: 3D Model Parallelism, Mixed Precision Training, Gradient Accumulation, Indexed Dataset Abstraction, Configuration-Driven Architecture.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.