bigscience-workshop/megatron-deepspeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

1,437 stars Python 8 components

Trains transformer language models at scale using distributed computing with pipeline and tensor parallelism

Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component ml training. 170 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Initialize Distributed Environment — The initialize_megatron function parses command line arguments to get model architecture and training hyperparameters, initializes process groups for tensor parallelism and pipeline parallelism using torch.distributed, and sets up DeepSpeed integration with ZeRO optimizer state partitioning (config: tensor_model_parallel_size, pipeline_model_parallel_size, deepspeed_config)
Load and Prepare Training Data — The build_train_valid_test_datasets function opens binary indexed dataset files using memory mapping, splits data according to train/validation/test ratios, and creates MegatronPretrainingSampler objects that ensure each data-parallel rank gets unique batches while tracking consumed samples [TrainingArgs → Dataset Objects] (config: data_path, split, seq_length)
Sample Training Batches — MegatronPretrainingSampler selects batch indices ensuring no overlap across data-parallel ranks, then MMapIndexedDataset.__getitem__ retrieves token sequences from binary files and pads/truncates them to the configured sequence length to create uniform batches [Dataset Objects → TokenizedBatch] (config: micro_batch_size, seq_length)
Execute Forward Pass Through Pipeline — TokenizedBatch enters the first pipeline stage of GPTModelPipe, activations flow forward through transformer layers distributed across pipeline stages with tensor-parallel computation within each stage, and the final stage computes loss against target tokens using cross-entropy [TokenizedBatch → Loss Values] (config: num_layers, hidden_size, num_attention_heads)
Accumulate Gradients Across Microbatches — The training loop in pretrain function processes multiple microbatches before optimizer step, backward pass computes gradients that flow backward through pipeline stages, and gradients are accumulated across microbatches while being synchronized across tensor-parallel ranks [Loss Values → Accumulated Gradients] (config: gradient_accumulation_steps)
Update Model Parameters — DistributedOptimizer synchronizes gradients across all parallelism dimensions, applies gradient clipping and learning rate schedule, performs optimizer step to update parameters, then broadcasts updated parameters to ensure consistency across distributed ranks [Accumulated Gradients → Updated Parameters] (config: lr, clip_grad, lr_decay_style)
Checkpoint Training State — At configured intervals, save_checkpoint collects model state dict from all pipeline stages, gathers optimizer state from all ranks, assembles training metadata including iteration count and consumed samples, then saves complete checkpoint to filesystem for fault tolerance [Updated Parameters → CheckpointState] (config: save_interval, checkpoint_dir)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TokenizedBatch megatron/data/gpt_dataset.py
dict with 'text': LongTensor[batch_size, seq_len] containing token IDs from vocabulary, where seq_len is the configured sequence length and token IDs are integers in range [0, vocab_size)
Created by dataset __getitem__ from indexed text files, consumed by model forward pass for next-token prediction

MaskedLMBatch megatron/data/bert_dataset.py
dict with 'text': LongTensor[batch_size, seq_len], 'types': LongTensor[batch_size, seq_len], 'is_random': LongTensor[batch_size], 'loss_mask': FloatTensor[batch_size, seq_len], 'lm_labels': LongTensor[batch_size, seq_len], 'padding_mask': BoolTensor[batch_size, seq_len]
Generated by BertDataset with masked token prediction targets, consumed by BERT model for masked language modeling loss calculation

ModelParallelState megatron/mpu/initialize.py
global state containing tensor_model_parallel_group: ProcessGroup, pipeline_model_parallel_group: ProcessGroup, data_parallel_group: ProcessGroup, and rank mappings for each parallelism dimension
Initialized once at startup based on world size and parallelism configuration, accessed throughout training for communication group routing

CheckpointState megatron/checkpointing.py
dict with 'model': state_dict, 'optimizer': optimizer_state, 'lr_scheduler': scheduler_state, 'iteration': int, 'consumed_train_samples': int, 'consumed_valid_samples': int, 'args': Namespace, 'checkpoint_version': float
Assembled during training checkpointing from distributed model/optimizer states, persisted to disk, loaded during training resume or evaluation

TrainingArgs megatron/arguments.py
argparse.Namespace with fields like num_layers: int, hidden_size: int, num_attention_heads: int, seq_length: int, tensor_model_parallel_size: int, pipeline_model_parallel_size: int, micro_batch_size: int, global_batch_size: int, lr: float, deepspeed_config: str
Parsed from command line arguments at startup, stored globally and accessed by all components to configure model architecture, training hyperparameters, and distributed settings

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Resource unguarded

The memory-mapped dataset files (.bin and .idx) remain stable and uncorrupted throughout training, with no external processes modifying them while they are mapped

If this fails: If files are modified, truncated, or corrupted during training, the memory mapping returns invalid data causing silent training corruption or segmentation faults without error detection

megatron/data/indexed_dataset.py:MMapIndexedDataset.__init__

critical Ordering unguarded

All data-parallel ranks advance through training iterations in lockstep, with consumed_samples counter staying synchronized across ranks

If this fails: If ranks get out of sync due to failures or restarts, different ranks will sample overlapping or duplicate batches, leading to biased training and incorrect gradient aggregation

megatron/data/data_samplers.py:MegatronPretrainingSampler.__iter__

critical Contract unguarded

The filesystem has sufficient space for the complete checkpoint and supports atomic write operations for large files (potentially hundreds of GB)

If this fails: If disk runs out of space during checkpoint save, the checkpoint becomes corrupted but training continues, leading to unrecoverable state loss when checkpoint is needed for recovery

megatron/checkpointing.py:save_checkpoint

critical Scale unguarded

The seq_length parameter matches the sequence length used during dataset preprocessing, and all sequences in the binary files are exactly seq_length tokens

If this fails: If preprocessing used different sequence length, the dataset returns incorrectly shaped tensors causing model dimension mismatches that fail silently or produce wrong results

megatron/data/gpt_dataset.py:build_train_valid_test_datasets

critical Environment unguarded

The parent directory structure exists and the megatron package is located exactly one directory up from tasks/main.py

If this fails: If directory structure changes or megatron is installed elsewhere, imports fail with cryptic ModuleNotFoundError, making the tasks module completely unusable

tasks/main.py:sys.path.append

warning Temporal weakly guarded

Checkpoint version is set exactly once before any checkpoint operations, and all ranks agree on the same version number

If this fails: If ranks have different checkpoint versions or version is changed mid-training, checkpoint save/load operations fail with assertion errors causing training to crash

megatron/checkpointing.py:set_checkpoint_version

warning Scale weakly guarded

Vocabulary size stays below 65500 tokens for uint16 optimization, and vocab_size parameter accurately reflects the actual vocabulary used

If this fails: If vocabulary exceeds 65500 tokens but uint16 is used, token IDs get silently truncated causing wrong token mappings and corrupted model inputs

megatron/data/indexed_dataset.py:best_fitting_dtype

warning Resource unguarded

The total_samples parameter exactly matches the number of samples in the actual dataset, with no samples added or removed after sampler initialization

If this fails: If dataset size differs from total_samples, the sampler produces out-of-bounds indices causing IndexError during data loading, or skips valid samples

megatron/data/data_samplers.py:MegatronPretrainingSampler.__init__

warning Domain unguarded

The vision tasks module expects to import a 'classification' module that exists in the same directory and has a main() function

If this fails: If classification.py is missing or main() function doesn't exist, the vision task fails with ImportError or AttributeError at runtime

tasks/vision/main.py:sys.path.append

warning Contract unguarded

The data_prefix[0] path exists and contains valid indexed dataset files (.bin and .idx) with consistent formatting

If this fails: If files are missing, corrupted, or have wrong format, dataset construction fails during training startup with unclear error messages about file access

megatron/data/gpt_dataset.py:_build_train_valid_test_datasets

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Indexed Training Dataset (file-store)
Binary files containing preprocessed and tokenized training text, with separate index files for efficient random access during distributed sampling

Distributed Model Parameters (in-memory)
Model weights distributed across tensor-parallel and pipeline-parallel ranks, with each rank holding a subset of the total parameters

Checkpoint Storage (file-store)
Periodic snapshots of complete distributed training state including model weights, optimizer states, and training progress metadata

Global Training State (in-memory)
Shared training configuration and runtime state accessible across all components, including parsed arguments and distributed group handles

Feedback Loops

Training Loop with Gradient Accumulation (training-loop, reinforcing) — Trigger: Start of each training iteration. Action: Process microbatches with forward/backward passes, accumulate gradients, then perform optimizer step when accumulation is complete. Exit: Reaches configured number of training iterations or manual termination.
Learning Rate Decay Schedule (convergence, balancing) — Trigger: Each optimizer step. Action: Adjusts learning rate according to schedule (cosine, linear, or polynomial decay) based on current training progress. Exit: Training completes or reaches minimum learning rate.
Loss Scaling for Mixed Precision (auto-scale, balancing) — Trigger: Detection of gradient overflow or underflow. Action: Dynamically adjusts loss scaling factor to prevent gradient underflow in fp16 training while avoiding overflow. Exit: Finds stable scaling factor or training terminates.

Delays

Pipeline Bubble Overhead (async-processing, ~(pipeline_parallel_size - 1) * microbatch_forward_time) — Pipeline stages wait for activation dependencies, creating idle time during warmup and cooldown phases of each batch
Gradient Synchronization (async-processing, ~Depends on network bandwidth and model size) — All ranks must synchronize gradients before optimizer step, creating synchronization barriers
Checkpoint Save Operations (checkpoint-save, ~Minutes depending on model size) — Training pauses while model state is gathered from all ranks and written to disk
Data Loading from Memory-Mapped Files (async-processing, ~Microseconds per sample) — Memory-mapped file access introduces small delays compared to in-memory data

Control Points

Model Parallelism Configuration (architecture-switch) — Controls: How model layers are distributed across GPUs via tensor_model_parallel_size and pipeline_model_parallel_size parameters. Default: Specified via command line
Mixed Precision Training Mode (precision-mode) — Controls: Whether training uses fp16, bf16, or fp32 precision, affecting memory usage and training speed. Default: fp16 or bf16
DeepSpeed ZeRO Optimization Stage (runtime-toggle) — Controls: Level of optimizer state and gradient partitioning (stage 1, 2, or 3) affecting memory distribution. Default: stage3
Gradient Accumulation Steps (hyperparameter) — Controls: Number of microbatches processed before optimizer step, trading memory for batch size. Default: Calculated from global_batch_size / (micro_batch_size * data_parallel_size)
Sequence Length Configuration (architecture-switch) — Controls: Maximum input sequence length, directly impacting memory usage and computation time. Default: 1024, 2048, or 4096 typically

Technology Stack

PyTorch (framework)
Provides tensor operations, automatic differentiation, and distributed communication primitives for model training

DeepSpeed (library)
Enables ZeRO optimizer state partitioning, mixed precision training, and advanced distributed optimization strategies

NVIDIA Apex (library)
Supplies optimized CUDA kernels for mixed precision training and fused operations to accelerate transformer computations

CUDA (runtime)
GPU computing platform for accelerated tensor operations and custom kernel execution in transformer layers

MPI/NCCL (infra)
Handles low-level inter-GPU and inter-node communication for distributed training coordination

NumPy (library)
Supports dataset preprocessing, indexing operations, and array manipulations in data pipeline

Key Components

initialize_megatron (orchestrator) — Coordinates the complete distributed training setup by parsing arguments, initializing model parallelism groups, setting up DeepSpeed integration, and preparing the global training state megatron/initialize.py
GPTModelPipe (transformer) — Implements pipeline-parallel GPT transformer model with layers distributed across pipeline stages, handling forward/backward pass coordination and gradient synchronization megatron/model/gpt_model.py
MegatronPretrainingSampler (scheduler) — Manages distributed sampling of training data across data-parallel ranks, ensuring each rank gets unique non-overlapping batches while tracking consumed samples for checkpointing megatron/data/data_samplers.py
build_train_valid_test_datasets (factory) — Constructs dataset objects from indexed binary files, handling data splitting across train/validation/test, blending multiple datasets with weights, and configuring sequence lengths megatron/data/gpt_dataset.py
save_checkpoint (serializer) — Collects model and optimizer state from all distributed ranks, assembles checkpoint metadata including training progress, and persists complete training state to filesystem megatron/checkpointing.py
pretrain (orchestrator) — Executes the main training loop with forward/backward passes, gradient accumulation across microbatches, optimizer steps, and learning rate scheduling while coordinating distributed communication megatron/training.py
DistributedOptimizer (optimizer) — Wraps PyTorch optimizers to handle gradient synchronization across tensor-parallel and pipeline-parallel ranks, ensuring consistent parameter updates in distributed setting megatron/optimizer/optimizer.py
indexed_dataset.MMapIndexedDataset (loader) — Provides memory-mapped access to large preprocessed datasets stored as binary files with separate index files, enabling efficient random access without loading entire dataset into memory megatron/data/indexed_dataset.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is Megatron-DeepSpeed used for?

Trains transformer language models at scale using distributed computing with pipeline and tensor parallelism bigscience-workshop/megatron-deepspeed is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 170 files.

How is Megatron-DeepSpeed architected?

Megatron-DeepSpeed is organized into 5 architecture layers: Training Entry Points, Model Architecture Layer, Distributed Training Infrastructure, Data Processing Pipeline, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through Megatron-DeepSpeed?

Data moves through 7 stages: Initialize Distributed Environment → Load and Prepare Training Data → Sample Training Batches → Execute Forward Pass Through Pipeline → Accumulate Gradients Across Microbatches → .... Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state. This pipeline design reflects a complex multi-stage processing system.

What technologies does Megatron-DeepSpeed use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, and distributed communication primitives for model training), DeepSpeed (Enables ZeRO optimizer state partitioning, mixed precision training, and advanced distributed optimization strategies), NVIDIA Apex (Supplies optimized CUDA kernels for mixed precision training and fused operations to accelerate transformer computations), CUDA (GPU computing platform for accelerated tensor operations and custom kernel execution in transformer layers), MPI/NCCL (Handles low-level inter-GPU and inter-node communication for distributed training coordination), NumPy (Supports dataset preprocessing, indexing operations, and array manipulations in data pipeline). A focused set of dependencies that keeps the build manageable.

What system dynamics does Megatron-DeepSpeed have?

Megatron-DeepSpeed exhibits 4 data pools (Indexed Training Dataset, Distributed Model Parameters), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does Megatron-DeepSpeed use?

5 design patterns detected: Model Parallelism Strategy, Gradient Accumulation with Microbatching, Memory-Mapped Dataset Access, Distributed Checkpoint State Management, Mixed Precision Training with Loss Scaling.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.