eleutherai/gpt-neox
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
Large-scale autoregressive transformer training framework using Megatron and DeepSpeed
Text data flows from preprocessed indexed datasets through distributed dataloaders to model parallel transformers, with checkpoints saved periodically and evaluation metrics logged to W&B
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
Structural Verdict
A 10-component ml training with 6 connections. 121 files analyzed. Loosely coupled — components are relatively independent.
How Data Flows Through the System
Text data flows from preprocessed indexed datasets through distributed dataloaders to model parallel transformers, with checkpoints saved periodically and evaluation metrics logged to W&B
- Data Preprocessing — Raw text tokenized and packed into indexed datasets using prepare_data.py (config: data_path, vocab_file, seq_length)
- Dataset Loading — GPT2Dataset loads token sequences with BlendableDataset for weighted sampling (config: data_path, train_data_paths, valid_data_paths +1)
- Distributed Batching — DistributedBatchSampler ensures consistent batches across data parallel workers (config: batch_size, micro_batch_size, data_parallel_size)
- Model Forward Pass — Transformer processes sequences with tensor/pipeline parallelism across GPUs (config: num_layers, hidden_size, num_attention_heads +2)
- Loss Computation — Cross-entropy loss calculated with gradient accumulation for large effective batch sizes (config: gradient_accumulation_steps, loss_scale)
- Optimizer Step — DeepSpeed optimizer updates model weights with ZeRO memory optimization (config: optimizer, lr, weight_decay +1)
- Checkpointing — Model state saved to distributed checkpoints at regular intervals (config: save, save_interval, checkpoint_factor)
- Evaluation — Model evaluated on validation tasks using lm-evaluation-harness (config: eval_tasks, eval_interval)
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Memory-mapped tokenized text sequences for training and validation
Distributed model state snapshots saved periodically during training
KV cache for attention layers during inference and generation
Accumulated gradients across microbatches before optimizer step
Feedback Loops
- Training Loop (recursive, balancing) — Trigger: Start of training epoch. Action: Forward pass, backward pass, optimizer step. Exit: Max steps or convergence reached.
- Checkpoint Recovery (retry, balancing) — Trigger: Training failure or restart. Action: Load latest checkpoint and resume from saved step. Exit: Successful model restoration.
- Gradient Accumulation (recursive, reinforcing) — Trigger: Start of training step. Action: Process microbatches and accumulate gradients. Exit: Gradient accumulation steps completed.
Delays & Async Processing
- Checkpoint Saving (async-processing, ~Varies by model size) — Training pauses while model state is written to disk
- Data Loading (batch-window, ~Per batch) — Workers wait for next batch of tokenized sequences
- Cross-GPU Communication (async-processing, ~Network latency) — Synchronization points in model parallel operations
- Evaluation Interval (scheduled-job, ~eval_interval steps) — Periodic evaluation pauses training for validation metrics
Control Points
- Learning Rate Schedule (runtime-toggle) — Controls: Optimizer learning rate decay during training. Default: lr scheduler config
- Parallelism Strategy (env-var) — Controls: Distribution of model across tensor/pipeline/data parallel dimensions. Default: model_parallel_size, pipe_parallel_size
- Gradient Clipping (threshold) — Controls: Maximum gradient norm before clipping. Default: clip_grad config
- Checkpoint Frequency (env-var) — Controls: How often model state is saved during training. Default: save_interval config
- Mixed Precision (feature-flag) — Controls: Enable FP16/BF16 training for memory efficiency. Default: fp16/bf16 config flags
Technology Stack
Deep learning framework
Distributed training optimization
Model parallel transformer implementation
GPU acceleration and custom kernels
Experiment tracking and logging
Standardized language model evaluation
Numerical computations and data handling
S3 checkpoint storage
Key Components
- NeoXArgs (config) — Central configuration management that parses YAML configs and command-line arguments
megatron/neox_arguments/__init__.py - setup_for_inference_or_eval (function) — Initializes model and distributed environment for evaluation and generation
megatron/utils.py - GPT2Dataset (class) — Handles efficient loading and batching of tokenized text sequences for training
megatron/data/gpt2_dataset.py - BlendableDataset (class) — Combines multiple datasets with weighted sampling for mixed training data
megatron/data/blendable_dataset.py - EvalHarnessAdapter (class) — Bridges GPT-NeoX models to the lm-evaluation-harness framework for standardized evaluation
eval_tasks/eval_adapter.py - forward_step (function) — Executes forward pass through transformer model with gradient accumulation
megatron/training.py - IndexedDataset (class) — Memory-mapped dataset implementation for efficient random access to large text corpora
megatron/data/indexed_dataset.py - load_checkpoint (function) — Handles distributed checkpoint loading with support for model parallel state reconstruction
megatron/checkpointing.py - generate_samples_from_prompt (function) — Autoregressive text generation with various decoding strategies and sampling methods
megatron/text_generation_utils.py - DistributedBatchSampler (class) — Ensures consistent batch distribution across data parallel workers
megatron/data/samplers.py
Sub-Modules
Main distributed training loop with model parallel transformers
Model evaluation using lm-evaluation-harness tasks
Autoregressive text generation and inference
Convert raw text to tokenized indexed datasets
Configuration
configs/1-3B-transformer-engine.yml (yaml)
pipe_parallel_size(number, unknown) — default: 1model_parallel_size(number, unknown) — default: 1num_layers(number, unknown) — default: 24hidden_size(number, unknown) — default: 2048num_attention_heads(number, unknown) — default: 16seq_length(number, unknown) — default: 2048max_position_embeddings(number, unknown) — default: 2048norm(string, unknown) — default: layernorm- +60 more parameters
configs/1-3B.yml (yaml)
pipe_parallel_size(number, unknown) — default: 1model_parallel_size(number, unknown) — default: 1num_layers(number, unknown) — default: 24hidden_size(number, unknown) — default: 2048num_attention_heads(number, unknown) — default: 16seq_length(number, unknown) — default: 2048max_position_embeddings(number, unknown) — default: 2048norm(string, unknown) — default: layernorm- +50 more parameters
configs/125M-dmoe.yml (yaml)
pipe_parallel_size(number, unknown) — default: 1model_parallel_size(number, unknown) — default: 1num_layers(number, unknown) — default: 12hidden_size(number, unknown) — default: 768num_attention_heads(number, unknown) — default: 12seq_length(number, unknown) — default: 2048max_position_embeddings(number, unknown) — default: 2048norm(string, unknown) — default: layernorm- +48 more parameters
configs/125M-json.yml (yaml)
pipe_parallel_size(number, unknown) — default: 1model_parallel_size(number, unknown) — default: 1num_layers(number, unknown) — default: 12hidden_size(number, unknown) — default: 768num_attention_heads(number, unknown) — default: 12seq_length(number, unknown) — default: 2048max_position_embeddings(number, unknown) — default: 2048norm(string, unknown) — default: layernorm- +50 more parameters
Science Pipeline
- Tokenization — Raw text converted to token IDs using vocab_file tokenizer [Variable length strings → (num_documents, variable_length)]
prepare_data.py - Sequence Packing — Token sequences packed into fixed-length training samples [(num_documents, variable_length) → (num_samples, seq_length)]
megatron/data/gpt2_dataset.py - Embedding Lookup — Token IDs converted to dense embeddings with positional encoding [(batch_size, seq_length) → (batch_size, seq_length, hidden_size)]
megatron/model/language_model.py - Transformer Layers — Self-attention and feedforward processing across num_layers [(batch_size, seq_length, hidden_size) → (batch_size, seq_length, hidden_size)]
megatron/model/transformer.py - Language Modeling Head — Hidden states projected to vocabulary logits for next token prediction [(batch_size, seq_length, hidden_size) → (batch_size, seq_length, vocab_size)]
megatron/model/language_model.py
Assumptions & Constraints
- [warning] Assumes input sequences are exactly seq_length tokens but truncation/padding logic varies by dataset (shape)
- [critical] Assumes tensor parallel groups are on same physical GPU devices for efficient communication (device)
- [critical] Assumes specific CUDA compute capability and version for kernel compilation (dependency)
- [warning] Dataset weights assumed to be positive numbers but no explicit validation enforced (value-range)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is gpt-neox used for?
Large-scale autoregressive transformer training framework using Megatron and DeepSpeed eleutherai/gpt-neox is a 10-component ml training written in Python. Loosely coupled — components are relatively independent. The codebase contains 121 files.
How is gpt-neox architected?
gpt-neox is organized into 5 architecture layers: Entry Scripts, Configuration Layer, Core Engine, Data Processing, and 1 more. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.
How does data flow through gpt-neox?
Data moves through 8 stages: Data Preprocessing → Dataset Loading → Distributed Batching → Model Forward Pass → Loss Computation → .... Text data flows from preprocessed indexed datasets through distributed dataloaders to model parallel transformers, with checkpoints saved periodically and evaluation metrics logged to W&B This pipeline design reflects a complex multi-stage processing system.
What technologies does gpt-neox use?
The core stack includes PyTorch (Deep learning framework), DeepSpeed (Distributed training optimization), Megatron-LM (Model parallel transformer implementation), CUDA (GPU acceleration and custom kernels), Weights & Biases (Experiment tracking and logging), lm-evaluation-harness (Standardized language model evaluation), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does gpt-neox have?
gpt-neox exhibits 4 data pools (Indexed Datasets, Model Checkpoints), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does gpt-neox use?
5 design patterns detected: Model Parallel Architecture, Configuration-Driven Design, Distributed Training Orchestration, Memory-Mapped Data Loading, CUDA Kernel Optimization.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.