bigscience-workshop/megatron-deepspeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Trains transformer language models at scale using distributed computing with pipeline and tensor parallelism
Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state.
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 8-component ml training. 170 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state.
- Initialize Distributed Environment — The initialize_megatron function parses command line arguments to get model architecture and training hyperparameters, initializes process groups for tensor parallelism and pipeline parallelism using torch.distributed, and sets up DeepSpeed integration with ZeRO optimizer state partitioning (config: tensor_model_parallel_size, pipeline_model_parallel_size, deepspeed_config)
- Load and Prepare Training Data — The build_train_valid_test_datasets function opens binary indexed dataset files using memory mapping, splits data according to train/validation/test ratios, and creates MegatronPretrainingSampler objects that ensure each data-parallel rank gets unique batches while tracking consumed samples [TrainingArgs → Dataset Objects] (config: data_path, split, seq_length)
- Sample Training Batches — MegatronPretrainingSampler selects batch indices ensuring no overlap across data-parallel ranks, then MMapIndexedDataset.__getitem__ retrieves token sequences from binary files and pads/truncates them to the configured sequence length to create uniform batches [Dataset Objects → TokenizedBatch] (config: micro_batch_size, seq_length)
- Execute Forward Pass Through Pipeline — TokenizedBatch enters the first pipeline stage of GPTModelPipe, activations flow forward through transformer layers distributed across pipeline stages with tensor-parallel computation within each stage, and the final stage computes loss against target tokens using cross-entropy [TokenizedBatch → Loss Values] (config: num_layers, hidden_size, num_attention_heads)
- Accumulate Gradients Across Microbatches — The training loop in pretrain function processes multiple microbatches before optimizer step, backward pass computes gradients that flow backward through pipeline stages, and gradients are accumulated across microbatches while being synchronized across tensor-parallel ranks [Loss Values → Accumulated Gradients] (config: gradient_accumulation_steps)
- Update Model Parameters — DistributedOptimizer synchronizes gradients across all parallelism dimensions, applies gradient clipping and learning rate schedule, performs optimizer step to update parameters, then broadcasts updated parameters to ensure consistency across distributed ranks [Accumulated Gradients → Updated Parameters] (config: lr, clip_grad, lr_decay_style)
- Checkpoint Training State — At configured intervals, save_checkpoint collects model state dict from all pipeline stages, gathers optimizer state from all ranks, assembles training metadata including iteration count and consumed samples, then saves complete checkpoint to filesystem for fault tolerance [Updated Parameters → CheckpointState] (config: save_interval, checkpoint_dir)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
megatron/data/gpt_dataset.pydict with 'text': LongTensor[batch_size, seq_len] containing token IDs from vocabulary, where seq_len is the configured sequence length and token IDs are integers in range [0, vocab_size)
Created by dataset __getitem__ from indexed text files, consumed by model forward pass for next-token prediction
megatron/data/bert_dataset.pydict with 'text': LongTensor[batch_size, seq_len], 'types': LongTensor[batch_size, seq_len], 'is_random': LongTensor[batch_size], 'loss_mask': FloatTensor[batch_size, seq_len], 'lm_labels': LongTensor[batch_size, seq_len], 'padding_mask': BoolTensor[batch_size, seq_len]
Generated by BertDataset with masked token prediction targets, consumed by BERT model for masked language modeling loss calculation
megatron/mpu/initialize.pyglobal state containing tensor_model_parallel_group: ProcessGroup, pipeline_model_parallel_group: ProcessGroup, data_parallel_group: ProcessGroup, and rank mappings for each parallelism dimension
Initialized once at startup based on world size and parallelism configuration, accessed throughout training for communication group routing
megatron/checkpointing.pydict with 'model': state_dict, 'optimizer': optimizer_state, 'lr_scheduler': scheduler_state, 'iteration': int, 'consumed_train_samples': int, 'consumed_valid_samples': int, 'args': Namespace, 'checkpoint_version': float
Assembled during training checkpointing from distributed model/optimizer states, persisted to disk, loaded during training resume or evaluation
megatron/arguments.pyargparse.Namespace with fields like num_layers: int, hidden_size: int, num_attention_heads: int, seq_length: int, tensor_model_parallel_size: int, pipeline_model_parallel_size: int, micro_batch_size: int, global_batch_size: int, lr: float, deepspeed_config: str
Parsed from command line arguments at startup, stored globally and accessed by all components to configure model architecture, training hyperparameters, and distributed settings
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The memory-mapped dataset files (.bin and .idx) remain stable and uncorrupted throughout training, with no external processes modifying them while they are mapped
If this fails: If files are modified, truncated, or corrupted during training, the memory mapping returns invalid data causing silent training corruption or segmentation faults without error detection
megatron/data/indexed_dataset.py:MMapIndexedDataset.__init__
All data-parallel ranks advance through training iterations in lockstep, with consumed_samples counter staying synchronized across ranks
If this fails: If ranks get out of sync due to failures or restarts, different ranks will sample overlapping or duplicate batches, leading to biased training and incorrect gradient aggregation
megatron/data/data_samplers.py:MegatronPretrainingSampler.__iter__
The filesystem has sufficient space for the complete checkpoint and supports atomic write operations for large files (potentially hundreds of GB)
If this fails: If disk runs out of space during checkpoint save, the checkpoint becomes corrupted but training continues, leading to unrecoverable state loss when checkpoint is needed for recovery
megatron/checkpointing.py:save_checkpoint
The seq_length parameter matches the sequence length used during dataset preprocessing, and all sequences in the binary files are exactly seq_length tokens
If this fails: If preprocessing used different sequence length, the dataset returns incorrectly shaped tensors causing model dimension mismatches that fail silently or produce wrong results
megatron/data/gpt_dataset.py:build_train_valid_test_datasets
The parent directory structure exists and the megatron package is located exactly one directory up from tasks/main.py
If this fails: If directory structure changes or megatron is installed elsewhere, imports fail with cryptic ModuleNotFoundError, making the tasks module completely unusable
tasks/main.py:sys.path.append
Checkpoint version is set exactly once before any checkpoint operations, and all ranks agree on the same version number
If this fails: If ranks have different checkpoint versions or version is changed mid-training, checkpoint save/load operations fail with assertion errors causing training to crash
megatron/checkpointing.py:set_checkpoint_version
Vocabulary size stays below 65500 tokens for uint16 optimization, and vocab_size parameter accurately reflects the actual vocabulary used
If this fails: If vocabulary exceeds 65500 tokens but uint16 is used, token IDs get silently truncated causing wrong token mappings and corrupted model inputs
megatron/data/indexed_dataset.py:best_fitting_dtype
The total_samples parameter exactly matches the number of samples in the actual dataset, with no samples added or removed after sampler initialization
If this fails: If dataset size differs from total_samples, the sampler produces out-of-bounds indices causing IndexError during data loading, or skips valid samples
megatron/data/data_samplers.py:MegatronPretrainingSampler.__init__
The vision tasks module expects to import a 'classification' module that exists in the same directory and has a main() function
If this fails: If classification.py is missing or main() function doesn't exist, the vision task fails with ImportError or AttributeError at runtime
tasks/vision/main.py:sys.path.append
The data_prefix[0] path exists and contains valid indexed dataset files (.bin and .idx) with consistent formatting
If this fails: If files are missing, corrupted, or have wrong format, dataset construction fails during training startup with unclear error messages about file access
megatron/data/gpt_dataset.py:_build_train_valid_test_datasets
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Binary files containing preprocessed and tokenized training text, with separate index files for efficient random access during distributed sampling
Model weights distributed across tensor-parallel and pipeline-parallel ranks, with each rank holding a subset of the total parameters
Periodic snapshots of complete distributed training state including model weights, optimizer states, and training progress metadata
Shared training configuration and runtime state accessible across all components, including parsed arguments and distributed group handles
Feedback Loops
- Training Loop with Gradient Accumulation (training-loop, reinforcing) — Trigger: Start of each training iteration. Action: Process microbatches with forward/backward passes, accumulate gradients, then perform optimizer step when accumulation is complete. Exit: Reaches configured number of training iterations or manual termination.
- Learning Rate Decay Schedule (convergence, balancing) — Trigger: Each optimizer step. Action: Adjusts learning rate according to schedule (cosine, linear, or polynomial decay) based on current training progress. Exit: Training completes or reaches minimum learning rate.
- Loss Scaling for Mixed Precision (auto-scale, balancing) — Trigger: Detection of gradient overflow or underflow. Action: Dynamically adjusts loss scaling factor to prevent gradient underflow in fp16 training while avoiding overflow. Exit: Finds stable scaling factor or training terminates.
Delays
- Pipeline Bubble Overhead (async-processing, ~(pipeline_parallel_size - 1) * microbatch_forward_time) — Pipeline stages wait for activation dependencies, creating idle time during warmup and cooldown phases of each batch
- Gradient Synchronization (async-processing, ~Depends on network bandwidth and model size) — All ranks must synchronize gradients before optimizer step, creating synchronization barriers
- Checkpoint Save Operations (checkpoint-save, ~Minutes depending on model size) — Training pauses while model state is gathered from all ranks and written to disk
- Data Loading from Memory-Mapped Files (async-processing, ~Microseconds per sample) — Memory-mapped file access introduces small delays compared to in-memory data
Control Points
- Model Parallelism Configuration (architecture-switch) — Controls: How model layers are distributed across GPUs via tensor_model_parallel_size and pipeline_model_parallel_size parameters. Default: Specified via command line
- Mixed Precision Training Mode (precision-mode) — Controls: Whether training uses fp16, bf16, or fp32 precision, affecting memory usage and training speed. Default: fp16 or bf16
- DeepSpeed ZeRO Optimization Stage (runtime-toggle) — Controls: Level of optimizer state and gradient partitioning (stage 1, 2, or 3) affecting memory distribution. Default: stage3
- Gradient Accumulation Steps (hyperparameter) — Controls: Number of microbatches processed before optimizer step, trading memory for batch size. Default: Calculated from global_batch_size / (micro_batch_size * data_parallel_size)
- Sequence Length Configuration (architecture-switch) — Controls: Maximum input sequence length, directly impacting memory usage and computation time. Default: 1024, 2048, or 4096 typically
Technology Stack
Provides tensor operations, automatic differentiation, and distributed communication primitives for model training
Enables ZeRO optimizer state partitioning, mixed precision training, and advanced distributed optimization strategies
Supplies optimized CUDA kernels for mixed precision training and fused operations to accelerate transformer computations
GPU computing platform for accelerated tensor operations and custom kernel execution in transformer layers
Handles low-level inter-GPU and inter-node communication for distributed training coordination
Supports dataset preprocessing, indexing operations, and array manipulations in data pipeline
Key Components
- initialize_megatron (orchestrator) — Coordinates the complete distributed training setup by parsing arguments, initializing model parallelism groups, setting up DeepSpeed integration, and preparing the global training state
megatron/initialize.py - GPTModelPipe (transformer) — Implements pipeline-parallel GPT transformer model with layers distributed across pipeline stages, handling forward/backward pass coordination and gradient synchronization
megatron/model/gpt_model.py - MegatronPretrainingSampler (scheduler) — Manages distributed sampling of training data across data-parallel ranks, ensuring each rank gets unique non-overlapping batches while tracking consumed samples for checkpointing
megatron/data/data_samplers.py - build_train_valid_test_datasets (factory) — Constructs dataset objects from indexed binary files, handling data splitting across train/validation/test, blending multiple datasets with weights, and configuring sequence lengths
megatron/data/gpt_dataset.py - save_checkpoint (serializer) — Collects model and optimizer state from all distributed ranks, assembles checkpoint metadata including training progress, and persists complete training state to filesystem
megatron/checkpointing.py - pretrain (orchestrator) — Executes the main training loop with forward/backward passes, gradient accumulation across microbatches, optimizer steps, and learning rate scheduling while coordinating distributed communication
megatron/training.py - DistributedOptimizer (optimizer) — Wraps PyTorch optimizers to handle gradient synchronization across tensor-parallel and pipeline-parallel ranks, ensuring consistent parameter updates in distributed setting
megatron/optimizer/optimizer.py - indexed_dataset.MMapIndexedDataset (loader) — Provides memory-mapped access to large preprocessed datasets stored as binary files with separate index files, enabling efficient random access without loading entire dataset into memory
megatron/data/indexed_dataset.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is Megatron-DeepSpeed used for?
Trains transformer language models at scale using distributed computing with pipeline and tensor parallelism bigscience-workshop/megatron-deepspeed is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 170 files.
How is Megatron-DeepSpeed architected?
Megatron-DeepSpeed is organized into 5 architecture layers: Training Entry Points, Model Architecture Layer, Distributed Training Infrastructure, Data Processing Pipeline, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through Megatron-DeepSpeed?
Data moves through 7 stages: Initialize Distributed Environment → Load and Prepare Training Data → Sample Training Batches → Execute Forward Pass Through Pipeline → Accumulate Gradients Across Microbatches → .... Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state. This pipeline design reflects a complex multi-stage processing system.
What technologies does Megatron-DeepSpeed use?
The core stack includes PyTorch (Provides tensor operations, automatic differentiation, and distributed communication primitives for model training), DeepSpeed (Enables ZeRO optimizer state partitioning, mixed precision training, and advanced distributed optimization strategies), NVIDIA Apex (Supplies optimized CUDA kernels for mixed precision training and fused operations to accelerate transformer computations), CUDA (GPU computing platform for accelerated tensor operations and custom kernel execution in transformer layers), MPI/NCCL (Handles low-level inter-GPU and inter-node communication for distributed training coordination), NumPy (Supports dataset preprocessing, indexing operations, and array manipulations in data pipeline). A focused set of dependencies that keeps the build manageable.
What system dynamics does Megatron-DeepSpeed have?
Megatron-DeepSpeed exhibits 4 data pools (Indexed Training Dataset, Distributed Model Parameters), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does Megatron-DeepSpeed use?
5 design patterns detected: Model Parallelism Strategy, Gradient Accumulation with Microbatching, Memory-Mapped Dataset Access, Distributed Checkpoint State Management, Mixed Precision Training with Loss Scaling.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.