eleutherai/gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

7,421 stars Python 10 components

Trains billion-parameter autoregressive language models across multiple GPUs using model parallelism

Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 10-component ml training. 121 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling.

  1. Configuration and initialization — NeoXArgs.consume_deepy_args() parses YAML config files, validates all parameters, and initialize_model_parallel() creates process groups for tensor, pipeline, and data parallelism based on pipe_parallel_size and model_parallel_size settings (config: pipe_parallel_size, model_parallel_size, num_layers +2)
  2. Data pipeline construction — get_train_data_loader() loads IndexedDatasets from data_path, creates BlendableDataset with train_data_paths and blend weights, then wraps with DistributedBatchSampler to ensure each worker gets non-overlapping batches of size micro_batch_size [DatasetDocument → TrainingBatch] (config: data_path, train_data_paths, train_blend +2)
  3. Model forward pass — GPTModelPipe.forward() processes input tokens through embedding layer, then each ParallelTransformerLayer applies self-attention and MLP in parallel across tensor parallel groups, with activations communicated between pipeline stages until final language model head produces logits [TrainingBatch → ModelOutputs] (config: num_layers, hidden_size, num_attention_heads +3)
  4. Loss computation and backward pass — forward_step() computes cross-entropy loss between model logits and shifted input labels using loss_mask to ignore padding tokens, then backward pass computes gradients while pipeline parallel stages communicate gradient tensors [ModelOutputs] (config: loss_scale, gradient_clipping)
  5. Gradient synchronization and optimization — DeepSpeedEngine handles gradient reduction across data parallel groups, applies gradient clipping with clip_grad value, then optimizer.step() updates parameters using configured learning rate schedule and ZeRO optimizer states (config: learning_rate, lr_decay_style, clip_grad +2)
  6. Checkpoint saving — save_checkpoint() periodically collects distributed model weights from all tensor parallel ranks, gathers optimizer states, and writes CheckpointState to save directory with iteration number and random states for exact resumption [CheckpointState] (config: save, save_interval, checkpoint_factor)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

NeoXArgs megatron/neox_arguments/neox_args.py
dataclass with 200+ fields including num_layers: int, hidden_size: int, num_attention_heads: int, seq_length: int, batch_size: int, learning_rate: float, pipe_parallel_size: int, model_parallel_size: int, plus training hyperparameters, data paths, and distributed settings
Parsed from YAML configs at startup, validated and distributed to all processes, then accessed throughout training to control model architecture, parallelism strategy, and hyperparameters
TrainingBatch megatron/data/gpt2_dataset.py
dict with tokens: Tensor[batch_size, seq_length] (input token IDs), labels: Tensor[batch_size, seq_length] (target tokens, offset by 1), attention_mask: Tensor[batch_size, seq_length] (padding mask), loss_mask: Tensor[batch_size, seq_length] (which positions to compute loss on)
Created by GPT2Dataset from indexed token sequences, batched by DataLoader with distributed sampling, then fed through model forward pass where tokens become logits and labels compute cross-entropy loss
ModelOutputs megatron/model/transformer.py
tuple containing logits: Tensor[batch_size, seq_length, vocab_size] (predicted token probabilities) and optionally hidden_states: List[Tensor] (intermediate layer outputs) when specified
Produced by transformer forward pass through embedding, attention layers, and final projection — consumed by loss function during training or by sampling logic during generation
CheckpointState megatron/checkpointing.py
dict with model: OrderedDict (model weights), optimizer: dict (optimizer states), lr_scheduler: dict (learning rate schedule state), iteration: int (global step count), args: NeoXArgs (full config), rng_states: dict (random number generator states for reproducibility)
Assembled during save_checkpoint() by collecting distributed model shards and optimizer states, serialized to disk, then restored during load_checkpoint() to resume training from exact state
DatasetDocument megatron/data/indexed_dataset.py
numpy array of token IDs with variable length representing one document, accessed via IndexedDataset with document boundaries marked in separate index
Pre-tokenized documents stored in binary format, loaded by IndexedDataset and sampled by GPT2Dataset to create training sequences of seq_length tokens

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

DeepSpeed launcher can successfully spawn workers and all workers can communicate over the network topology — assumes network interfaces, hostnames, and port ranges are available

If this fails: Training hangs indefinitely during distributed initialization if workers can't establish communication, with no clear error message about network connectivity issues

deepy.py:main
critical Domain unguarded

Dataset blend weights are semantically meaningful proportions that sum to reasonable values — accepts any positive float array without validating they represent valid sampling probabilities

If this fails: Weights like [1000000, 0.001] create extreme sampling bias where one dataset dominates training, silently producing models trained on unintended data distributions

megatron/data/blendable_dataset.py:BlendableDataset.__init__
critical Scale unguarded

Global batch size (batch_size * world_size) fits within system memory limits and doesn't exceed dataset size — computes without checking memory constraints or dataset bounds

If this fails: Out-of-memory crashes during batch creation or infinite data loader loops when global_batch_size exceeds total available samples

megatron/data/data_utils.py:make_data_loader
critical Temporal weakly guarded

Distributed checkpoint saving completes atomically across all workers — if one worker fails during save, assumes others will detect and handle cleanup

If this fails: Partial checkpoint corruption where some workers write state but others fail, leaving training in unrecoverable state requiring manual cleanup and rollback

megatron/checkpointing.py
critical Resource unguarded

GPU memory during evaluation is sufficient for model inference plus evaluation task data structures — doesn't account for peak memory during forward passes with long sequences

If this fails: OOM crashes during evaluation with sequences longer than training data, even though model loaded successfully, causing evaluation jobs to fail silently

eval_tasks/eval_adapter.py:EvalHarnessAdapter
warning Ordering unguarded

Environment variable injection (WANDB_API_KEY) occurs before DeepSpeed worker spawn — timing assumes synchronous environment setup

If this fails: Workers start without WandB credentials when environment setup races with process creation, causing silent logging failures that are only discovered after long training runs

deepy.py:main
warning Contract weakly guarded

setup_for_inference_or_eval() returns model in correct inference mode state — trusts returned model is ready for generation without verifying eval mode or gradient settings

If this fails: Generation produces inconsistent results when model remains in training mode with dropout enabled, or consumes unnecessary memory with gradient computation

generate.py:main
warning Environment unguarded

Storage backend (local filesystem or S3) has consistent write semantics — assumes atomic write operations and immediate read-after-write consistency

If this fails: Checkpoint corruption on distributed filesystems with eventual consistency, where workers read partially written checkpoint data during restoration

megatron/checkpointing.py
warning Scale unguarded

Number of data loading workers (num_workers) scales appropriately with available CPU cores and I/O bandwidth — uses configured value without system resource validation

If this fails: Performance degradation when num_workers exceeds CPU cores or saturates disk I/O, creating data loading bottlenecks that throttle GPU utilization

megatron/data/data_utils.py:make_data_loader
warning Domain unguarded

Evaluation harness tasks expect model outputs in specific tensor formats with consistent dtype and device placement — no validation of output compatibility

If this fails: Evaluation failures with cryptic errors when model produces outputs in unexpected formats (e.g., different precision, wrong device), making debugging difficult

eval.py:main

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

IndexedDataset cache (file-store)
Memory-mapped binary files containing pre-tokenized documents with separate index files for fast random access to document boundaries and token sequences
Distributed model parameters (in-memory)
Transformer weights sharded across tensor parallel groups — attention heads and MLP weights split across GPUs within each pipeline stage
ZeRO optimizer states (state-store)
Optimizer momentum and variance states partitioned across data parallel workers to reduce memory usage, gathered during checkpoint saving
Checkpoint storage (file-store)
Periodic snapshots of complete model state including weights, optimizer states, learning rate schedule, and random number generator states for training resumption

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Provides core tensor operations, autograd, and distributed primitives — the foundation for model implementation and gradient computation
DeepSpeed (framework)
Handles ZeRO optimizer state partitioning, mixed precision training, gradient scaling, and provides the launcher for multi-node training coordination
CUDA/cuDNN (compute)
GPU acceleration for matrix operations, attention kernels, and custom fused operations — critical for training performance on NVIDIA hardware
NCCL (infra)
GPU-to-GPU communication for gradient synchronization, tensor parallel all-reduce operations, and pipeline stage communication
Weights & Biases (library)
Training metrics logging, hyperparameter tracking, and experiment management — provides real-time monitoring of distributed training runs
Flash Attention (library)
Memory-efficient attention computation that reduces O(n²) memory usage to O(n) for long sequences — enables training on longer contexts
Megatron Tokenizer (library)
Provides GPT-2 BPE tokenization and vocab handling — converts text to token IDs that the model processes
lm-evaluation-harness (library)
Standardized evaluation on downstream tasks like LAMBADA, HellaSwag, and PIQA — enables consistent model comparison across the field

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is gpt-neox used for?

Trains billion-parameter autoregressive language models across multiple GPUs using model parallelism eleutherai/gpt-neox is a 10-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 121 files.

How is gpt-neox architected?

gpt-neox is organized into 4 architecture layers: Script Entry Points, Core Training Engine, Data Pipeline, Distributed Infrastructure. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through gpt-neox?

Data moves through 6 stages: Configuration and initialization → Data pipeline construction → Model forward pass → Loss computation and backward pass → Gradient synchronization and optimization → .... Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling. This pipeline design reflects a complex multi-stage processing system.

What technologies does gpt-neox use?

The core stack includes PyTorch (Provides core tensor operations, autograd, and distributed primitives — the foundation for model implementation and gradient computation), DeepSpeed (Handles ZeRO optimizer state partitioning, mixed precision training, gradient scaling, and provides the launcher for multi-node training coordination), CUDA/cuDNN (GPU acceleration for matrix operations, attention kernels, and custom fused operations — critical for training performance on NVIDIA hardware), NCCL (GPU-to-GPU communication for gradient synchronization, tensor parallel all-reduce operations, and pipeline stage communication), Weights & Biases (Training metrics logging, hyperparameter tracking, and experiment management — provides real-time monitoring of distributed training runs), Flash Attention (Memory-efficient attention computation that reduces O(n²) memory usage to O(n) for long sequences — enables training on longer contexts), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does gpt-neox have?

gpt-neox exhibits 4 data pools (IndexedDataset cache, Distributed model parameters), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does gpt-neox use?

5 design patterns detected: 3D Model Parallelism, Mixed Precision Training, Gradient Accumulation, Indexed Dataset Abstraction, Configuration-Driven Architecture.

Analyzed on April 20, 2026 by CodeSea. Written by .