eleutherai/gpt-neox
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Trains billion-parameter autoregressive language models across multiple GPUs using model parallelism
Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling.
Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 10-component ml training. 121 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling.
- Configuration and initialization — NeoXArgs.consume_deepy_args() parses YAML config files, validates all parameters, and initialize_model_parallel() creates process groups for tensor, pipeline, and data parallelism based on pipe_parallel_size and model_parallel_size settings (config: pipe_parallel_size, model_parallel_size, num_layers +2)
- Data pipeline construction — get_train_data_loader() loads IndexedDatasets from data_path, creates BlendableDataset with train_data_paths and blend weights, then wraps with DistributedBatchSampler to ensure each worker gets non-overlapping batches of size micro_batch_size [DatasetDocument → TrainingBatch] (config: data_path, train_data_paths, train_blend +2)
- Model forward pass — GPTModelPipe.forward() processes input tokens through embedding layer, then each ParallelTransformerLayer applies self-attention and MLP in parallel across tensor parallel groups, with activations communicated between pipeline stages until final language model head produces logits [TrainingBatch → ModelOutputs] (config: num_layers, hidden_size, num_attention_heads +3)
- Loss computation and backward pass — forward_step() computes cross-entropy loss between model logits and shifted input labels using loss_mask to ignore padding tokens, then backward pass computes gradients while pipeline parallel stages communicate gradient tensors [ModelOutputs] (config: loss_scale, gradient_clipping)
- Gradient synchronization and optimization — DeepSpeedEngine handles gradient reduction across data parallel groups, applies gradient clipping with clip_grad value, then optimizer.step() updates parameters using configured learning rate schedule and ZeRO optimizer states (config: learning_rate, lr_decay_style, clip_grad +2)
- Checkpoint saving — save_checkpoint() periodically collects distributed model weights from all tensor parallel ranks, gathers optimizer states, and writes CheckpointState to save directory with iteration number and random states for exact resumption [CheckpointState] (config: save, save_interval, checkpoint_factor)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
megatron/neox_arguments/neox_args.pydataclass with 200+ fields including num_layers: int, hidden_size: int, num_attention_heads: int, seq_length: int, batch_size: int, learning_rate: float, pipe_parallel_size: int, model_parallel_size: int, plus training hyperparameters, data paths, and distributed settings
Parsed from YAML configs at startup, validated and distributed to all processes, then accessed throughout training to control model architecture, parallelism strategy, and hyperparameters
megatron/data/gpt2_dataset.pydict with tokens: Tensor[batch_size, seq_length] (input token IDs), labels: Tensor[batch_size, seq_length] (target tokens, offset by 1), attention_mask: Tensor[batch_size, seq_length] (padding mask), loss_mask: Tensor[batch_size, seq_length] (which positions to compute loss on)
Created by GPT2Dataset from indexed token sequences, batched by DataLoader with distributed sampling, then fed through model forward pass where tokens become logits and labels compute cross-entropy loss
megatron/model/transformer.pytuple containing logits: Tensor[batch_size, seq_length, vocab_size] (predicted token probabilities) and optionally hidden_states: List[Tensor] (intermediate layer outputs) when specified
Produced by transformer forward pass through embedding, attention layers, and final projection — consumed by loss function during training or by sampling logic during generation
megatron/checkpointing.pydict with model: OrderedDict (model weights), optimizer: dict (optimizer states), lr_scheduler: dict (learning rate schedule state), iteration: int (global step count), args: NeoXArgs (full config), rng_states: dict (random number generator states for reproducibility)
Assembled during save_checkpoint() by collecting distributed model shards and optimizer states, serialized to disk, then restored during load_checkpoint() to resume training from exact state
megatron/data/indexed_dataset.pynumpy array of token IDs with variable length representing one document, accessed via IndexedDataset with document boundaries marked in separate index
Pre-tokenized documents stored in binary format, loaded by IndexedDataset and sampled by GPT2Dataset to create training sequences of seq_length tokens
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
DeepSpeed launcher can successfully spawn workers and all workers can communicate over the network topology — assumes network interfaces, hostnames, and port ranges are available
If this fails: Training hangs indefinitely during distributed initialization if workers can't establish communication, with no clear error message about network connectivity issues
deepy.py:main
Dataset blend weights are semantically meaningful proportions that sum to reasonable values — accepts any positive float array without validating they represent valid sampling probabilities
If this fails: Weights like [1000000, 0.001] create extreme sampling bias where one dataset dominates training, silently producing models trained on unintended data distributions
megatron/data/blendable_dataset.py:BlendableDataset.__init__
Global batch size (batch_size * world_size) fits within system memory limits and doesn't exceed dataset size — computes without checking memory constraints or dataset bounds
If this fails: Out-of-memory crashes during batch creation or infinite data loader loops when global_batch_size exceeds total available samples
megatron/data/data_utils.py:make_data_loader
Distributed checkpoint saving completes atomically across all workers — if one worker fails during save, assumes others will detect and handle cleanup
If this fails: Partial checkpoint corruption where some workers write state but others fail, leaving training in unrecoverable state requiring manual cleanup and rollback
megatron/checkpointing.py
GPU memory during evaluation is sufficient for model inference plus evaluation task data structures — doesn't account for peak memory during forward passes with long sequences
If this fails: OOM crashes during evaluation with sequences longer than training data, even though model loaded successfully, causing evaluation jobs to fail silently
eval_tasks/eval_adapter.py:EvalHarnessAdapter
Environment variable injection (WANDB_API_KEY) occurs before DeepSpeed worker spawn — timing assumes synchronous environment setup
If this fails: Workers start without WandB credentials when environment setup races with process creation, causing silent logging failures that are only discovered after long training runs
deepy.py:main
setup_for_inference_or_eval() returns model in correct inference mode state — trusts returned model is ready for generation without verifying eval mode or gradient settings
If this fails: Generation produces inconsistent results when model remains in training mode with dropout enabled, or consumes unnecessary memory with gradient computation
generate.py:main
Storage backend (local filesystem or S3) has consistent write semantics — assumes atomic write operations and immediate read-after-write consistency
If this fails: Checkpoint corruption on distributed filesystems with eventual consistency, where workers read partially written checkpoint data during restoration
megatron/checkpointing.py
Number of data loading workers (num_workers) scales appropriately with available CPU cores and I/O bandwidth — uses configured value without system resource validation
If this fails: Performance degradation when num_workers exceeds CPU cores or saturates disk I/O, creating data loading bottlenecks that throttle GPU utilization
megatron/data/data_utils.py:make_data_loader
Evaluation harness tasks expect model outputs in specific tensor formats with consistent dtype and device placement — no validation of output compatibility
If this fails: Evaluation failures with cryptic errors when model produces outputs in unexpected formats (e.g., different precision, wrong device), making debugging difficult
eval.py:main
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Memory-mapped binary files containing pre-tokenized documents with separate index files for fast random access to document boundaries and token sequences
Transformer weights sharded across tensor parallel groups — attention heads and MLP weights split across GPUs within each pipeline stage
Optimizer momentum and variance states partitioned across data parallel workers to reduce memory usage, gathered during checkpoint saving
Periodic snapshots of complete model state including weights, optimizer states, learning rate schedule, and random number generator states for training resumption
Feedback Loops
- Training iteration loop (training-loop, reinforcing) — Trigger: train_iters configuration parameter. Action: Each iteration loads a batch, runs forward pass, computes loss, backpropagates gradients, and updates parameters via DeepSpeed optimizer. Exit: When iteration count reaches train_iters or manual termination.
- Gradient accumulation cycle (gradient-accumulation, balancing) — Trigger: gradient_accumulation_steps > 1. Action: Accumulates gradients from multiple micro-batches before calling optimizer.step(), allowing larger effective batch sizes than GPU memory permits. Exit: After gradient_accumulation_steps micro-batches processed.
- Learning rate scheduling (convergence, balancing) — Trigger: lr_decay_style configuration (cosine, linear, etc.). Action: Adjusts learning rate each step according to schedule — typically starting high and decaying to improve convergence. Exit: Training completion.
- Checkpoint retry mechanism (retry, balancing) — Trigger: Checkpoint save/load failures. Action: Retries checkpoint operations with exponential backoff, especially important for distributed filesystems and S3 storage. Exit: Successful checkpoint operation or max retries exceeded.
Delays
- Pipeline bubble delay (async-processing, ~depends on pipe_parallel_size) — Pipeline stages wait for previous stage outputs — first and last stages have idle time during pipeline fill/drain phases
- Gradient synchronization delay (eventual-consistency, ~network latency dependent) — Data parallel workers wait for gradient reduction before optimizer step — scales with number of workers and network topology
- Checkpoint save intervals (scheduled-job, ~save_interval steps) — Training pauses periodically to save distributed model state — duration depends on model size and storage speed
- Dataset loading warmup (warmup, ~varies by dataset size) — Initial data loader creation maps indexed datasets into memory and builds sampling indices before first training step
Control Points
- Model architecture switches (architecture-switch) — Controls: Fundamental model shape via num_layers, hidden_size, num_attention_heads — determines memory usage and computational requirements. Default: varies per config (125M to 175B parameter ranges)
- Parallelism strategy (runtime-toggle) — Controls: How model is distributed via pipe_parallel_size (layers across GPUs) and model_parallel_size (tensors across GPUs). Default: typically 1-8 for each dimension
- Precision mode selection (precision-mode) — Controls: Training precision via fp16, bf16, or fp32 settings — affects memory usage, speed, and numerical stability. Default: fp16 enabled in most configs
- Learning rate schedule (hyperparameter) — Controls: Convergence behavior via learning_rate, lr_decay_style, warmup_iters — critical for training stability. Default: varies by model size, typically 1e-4 to 6e-4
- Data blending weights (sampling-strategy) — Controls: Training data mixture via train_blend — determines what proportion of each dataset appears in training. Default: configured per training run
Technology Stack
Provides core tensor operations, autograd, and distributed primitives — the foundation for model implementation and gradient computation
Handles ZeRO optimizer state partitioning, mixed precision training, gradient scaling, and provides the launcher for multi-node training coordination
GPU acceleration for matrix operations, attention kernels, and custom fused operations — critical for training performance on NVIDIA hardware
GPU-to-GPU communication for gradient synchronization, tensor parallel all-reduce operations, and pipeline stage communication
Training metrics logging, hyperparameter tracking, and experiment management — provides real-time monitoring of distributed training runs
Memory-efficient attention computation that reduces O(n²) memory usage to O(n) for long sequences — enables training on longer contexts
Provides GPT-2 BPE tokenization and vocab handling — converts text to token IDs that the model processes
Standardized evaluation on downstream tasks like LAMBADA, HellaSwag, and PIQA — enables consistent model comparison across the field
Key Components
- NeoXArgs (registry) — Central configuration registry that parses YAML configs into validated dataclass, handles argument inheritance, and provides typed access to all training hyperparameters and system settings
megatron/neox_arguments/neox_args.py - GPTModelPipe (orchestrator) — Main transformer model that orchestrates embedding layer, transformer blocks, and language model head — implements pipeline parallelism by splitting layers across pipeline stages
megatron/model/gpt2_model.py - BlendableDataset (adapter) — Combines multiple datasets with configurable sampling weights, ensuring each training step draws from the specified mixture of data sources according to blend ratios
megatron/data/blendable_dataset.py - DistributedDataParallel (orchestrator) — Manages gradient synchronization across data parallel groups, implements gradient clipping and accumulation, coordinates with tensor/pipeline parallel communication
megatron/model/distributed.py - DeepSpeedEngine (optimizer) — Integrates DeepSpeed's ZeRO optimizer, gradient scaling, learning rate scheduling, and memory optimization — handles the actual parameter updates and mixed precision training
megatron/training.py - ParallelTransformerLayer (processor) — Implements one transformer block with parallel attention and MLP computation, handles tensor parallel communication for weight matrices, applies layer norm and residual connections
megatron/model/transformer.py - initialize_model_parallel (allocator) — Sets up process groups for tensor parallelism (splitting attention heads/MLP) and pipeline parallelism (splitting layers), creates communication topology for multi-dimensional parallelism
megatron/mpu/initialize.py - get_train_data_loader (factory) — Constructs the training data pipeline by loading indexed datasets, applying blending weights, creating distributed samplers, and returning DataLoader with correct batch size for each worker
megatron/data/data_utils.py - save_checkpoint (serializer) — Coordinates distributed checkpoint saving by gathering model shards from tensor parallel groups, collecting optimizer states, and writing checkpoint files with metadata for resumption
megatron/checkpointing.py - EvalHarnessAdapter (adapter) — Bridges GPT-NeoX models with the lm-evaluation-harness framework by implementing the required interface for logit computation and text generation on evaluation tasks
eval_tasks/eval_adapter.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is gpt-neox used for?
Trains billion-parameter autoregressive language models across multiple GPUs using model parallelism eleutherai/gpt-neox is a 10-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 121 files.
How is gpt-neox architected?
gpt-neox is organized into 4 architecture layers: Script Entry Points, Core Training Engine, Data Pipeline, Distributed Infrastructure. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through gpt-neox?
Data moves through 6 stages: Configuration and initialization → Data pipeline construction → Model forward pass → Loss computation and backward pass → Gradient synchronization and optimization → .... Training starts by loading pre-tokenized datasets as indexed token arrays, blending them according to configured weights, and distributing batches across data parallel workers. Each training batch flows through the transformer model split across pipeline stages, producing logits that generate cross-entropy losses. Gradients backpropagate through the pipeline while being synchronized across data parallel groups, then DeepSpeed's optimizer updates the distributed model parameters with gradient clipping and learning rate scheduling. This pipeline design reflects a complex multi-stage processing system.
What technologies does gpt-neox use?
The core stack includes PyTorch (Provides core tensor operations, autograd, and distributed primitives — the foundation for model implementation and gradient computation), DeepSpeed (Handles ZeRO optimizer state partitioning, mixed precision training, gradient scaling, and provides the launcher for multi-node training coordination), CUDA/cuDNN (GPU acceleration for matrix operations, attention kernels, and custom fused operations — critical for training performance on NVIDIA hardware), NCCL (GPU-to-GPU communication for gradient synchronization, tensor parallel all-reduce operations, and pipeline stage communication), Weights & Biases (Training metrics logging, hyperparameter tracking, and experiment management — provides real-time monitoring of distributed training runs), Flash Attention (Memory-efficient attention computation that reduces O(n²) memory usage to O(n) for long sequences — enables training on longer contexts), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does gpt-neox have?
gpt-neox exhibits 4 data pools (IndexedDataset cache, Distributed model parameters), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does gpt-neox use?
5 design patterns detected: 3D Model Parallelism, Mixed Precision Training, Gradient Accumulation, Indexed Dataset Abstraction, Configuration-Driven Architecture.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.