bigscience-workshop/megatron-deepspeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

1,437 stars Python 8 components

Trains transformer language models at scale using distributed computing with pipeline and tensor parallelism

Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component ml training. 170 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state.

  1. Initialize Distributed Environment — The initialize_megatron function parses command line arguments to get model architecture and training hyperparameters, initializes process groups for tensor parallelism and pipeline parallelism using torch.distributed, and sets up DeepSpeed integration with ZeRO optimizer state partitioning (config: tensor_model_parallel_size, pipeline_model_parallel_size, deepspeed_config)
  2. Load and Prepare Training Data — The build_train_valid_test_datasets function opens binary indexed dataset files using memory mapping, splits data according to train/validation/test ratios, and creates MegatronPretrainingSampler objects that ensure each data-parallel rank gets unique batches while tracking consumed samples [TrainingArgs → Dataset Objects] (config: data_path, split, seq_length)
  3. Sample Training Batches — MegatronPretrainingSampler selects batch indices ensuring no overlap across data-parallel ranks, then MMapIndexedDataset.__getitem__ retrieves token sequences from binary files and pads/truncates them to the configured sequence length to create uniform batches [Dataset Objects → TokenizedBatch] (config: micro_batch_size, seq_length)
  4. Execute Forward Pass Through Pipeline — TokenizedBatch enters the first pipeline stage of GPTModelPipe, activations flow forward through transformer layers distributed across pipeline stages with tensor-parallel computation within each stage, and the final stage computes loss against target tokens using cross-entropy [TokenizedBatch → Loss Values] (config: num_layers, hidden_size, num_attention_heads)
  5. Accumulate Gradients Across Microbatches — The training loop in pretrain function processes multiple microbatches before optimizer step, backward pass computes gradients that flow backward through pipeline stages, and gradients are accumulated across microbatches while being synchronized across tensor-parallel ranks [Loss Values → Accumulated Gradients] (config: gradient_accumulation_steps)
  6. Update Model Parameters — DistributedOptimizer synchronizes gradients across all parallelism dimensions, applies gradient clipping and learning rate schedule, performs optimizer step to update parameters, then broadcasts updated parameters to ensure consistency across distributed ranks [Accumulated Gradients → Updated Parameters] (config: lr, clip_grad, lr_decay_style)
  7. Checkpoint Training State — At configured intervals, save_checkpoint collects model state dict from all pipeline stages, gathers optimizer state from all ranks, assembles training metadata including iteration count and consumed samples, then saves complete checkpoint to filesystem for fault tolerance [Updated Parameters → CheckpointState] (config: save_interval, checkpoint_dir)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TokenizedBatch megatron/data/gpt_dataset.py
dict with 'text': LongTensor[batch_size, seq_len] containing token IDs from vocabulary, where seq_len is the configured sequence length and token IDs are integers in range [0, vocab_size)
Created by dataset __getitem__ from indexed text files, consumed by model forward pass for next-token prediction
MaskedLMBatch megatron/data/bert_dataset.py
dict with 'text': LongTensor[batch_size, seq_len], 'types': LongTensor[batch_size, seq_len], 'is_random': LongTensor[batch_size], 'loss_mask': FloatTensor[batch_size, seq_len], 'lm_labels': LongTensor[batch_size, seq_len], 'padding_mask': BoolTensor[batch_size, seq_len]
Generated by BertDataset with masked token prediction targets, consumed by BERT model for masked language modeling loss calculation
ModelParallelState megatron/mpu/initialize.py
global state containing tensor_model_parallel_group: ProcessGroup, pipeline_model_parallel_group: ProcessGroup, data_parallel_group: ProcessGroup, and rank mappings for each parallelism dimension
Initialized once at startup based on world size and parallelism configuration, accessed throughout training for communication group routing
CheckpointState megatron/checkpointing.py
dict with 'model': state_dict, 'optimizer': optimizer_state, 'lr_scheduler': scheduler_state, 'iteration': int, 'consumed_train_samples': int, 'consumed_valid_samples': int, 'args': Namespace, 'checkpoint_version': float
Assembled during training checkpointing from distributed model/optimizer states, persisted to disk, loaded during training resume or evaluation
TrainingArgs megatron/arguments.py
argparse.Namespace with fields like num_layers: int, hidden_size: int, num_attention_heads: int, seq_length: int, tensor_model_parallel_size: int, pipeline_model_parallel_size: int, micro_batch_size: int, global_batch_size: int, lr: float, deepspeed_config: str
Parsed from command line arguments at startup, stored globally and accessed by all components to configure model architecture, training hyperparameters, and distributed settings

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Resource unguarded

The memory-mapped dataset files (.bin and .idx) remain stable and uncorrupted throughout training, with no external processes modifying them while they are mapped

If this fails: If files are modified, truncated, or corrupted during training, the memory mapping returns invalid data causing silent training corruption or segmentation faults without error detection

megatron/data/indexed_dataset.py:MMapIndexedDataset.__init__
critical Ordering unguarded

All data-parallel ranks advance through training iterations in lockstep, with consumed_samples counter staying synchronized across ranks

If this fails: If ranks get out of sync due to failures or restarts, different ranks will sample overlapping or duplicate batches, leading to biased training and incorrect gradient aggregation

megatron/data/data_samplers.py:MegatronPretrainingSampler.__iter__
critical Contract unguarded

The filesystem has sufficient space for the complete checkpoint and supports atomic write operations for large files (potentially hundreds of GB)

If this fails: If disk runs out of space during checkpoint save, the checkpoint becomes corrupted but training continues, leading to unrecoverable state loss when checkpoint is needed for recovery

megatron/checkpointing.py:save_checkpoint
critical Scale unguarded

The seq_length parameter matches the sequence length used during dataset preprocessing, and all sequences in the binary files are exactly seq_length tokens

If this fails: If preprocessing used different sequence length, the dataset returns incorrectly shaped tensors causing model dimension mismatches that fail silently or produce wrong results

megatron/data/gpt_dataset.py:build_train_valid_test_datasets
critical Environment unguarded

The parent directory structure exists and the megatron package is located exactly one directory up from tasks/main.py

If this fails: If directory structure changes or megatron is installed elsewhere, imports fail with cryptic ModuleNotFoundError, making the tasks module completely unusable

tasks/main.py:sys.path.append
warning Temporal weakly guarded

Checkpoint version is set exactly once before any checkpoint operations, and all ranks agree on the same version number

If this fails: If ranks have different checkpoint versions or version is changed mid-training, checkpoint save/load operations fail with assertion errors causing training to crash

megatron/checkpointing.py:set_checkpoint_version
warning Scale weakly guarded

Vocabulary size stays below 65500 tokens for uint16 optimization, and vocab_size parameter accurately reflects the actual vocabulary used

If this fails: If vocabulary exceeds 65500 tokens but uint16 is used, token IDs get silently truncated causing wrong token mappings and corrupted model inputs

megatron/data/indexed_dataset.py:best_fitting_dtype
warning Resource unguarded

The total_samples parameter exactly matches the number of samples in the actual dataset, with no samples added or removed after sampler initialization

If this fails: If dataset size differs from total_samples, the sampler produces out-of-bounds indices causing IndexError during data loading, or skips valid samples

megatron/data/data_samplers.py:MegatronPretrainingSampler.__init__
warning Domain unguarded

The vision tasks module expects to import a 'classification' module that exists in the same directory and has a main() function

If this fails: If classification.py is missing or main() function doesn't exist, the vision task fails with ImportError or AttributeError at runtime

tasks/vision/main.py:sys.path.append
warning Contract unguarded

The data_prefix[0] path exists and contains valid indexed dataset files (.bin and .idx) with consistent formatting

If this fails: If files are missing, corrupted, or have wrong format, dataset construction fails during training startup with unclear error messages about file access

megatron/data/gpt_dataset.py:_build_train_valid_test_datasets

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Indexed Training Dataset (file-store)
Binary files containing preprocessed and tokenized training text, with separate index files for efficient random access during distributed sampling
Distributed Model Parameters (in-memory)
Model weights distributed across tensor-parallel and pipeline-parallel ranks, with each rank holding a subset of the total parameters
Checkpoint Storage (file-store)
Periodic snapshots of complete distributed training state including model weights, optimizer states, and training progress metadata
Global Training State (in-memory)
Shared training configuration and runtime state accessible across all components, including parsed arguments and distributed group handles

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Provides tensor operations, automatic differentiation, and distributed communication primitives for model training
DeepSpeed (library)
Enables ZeRO optimizer state partitioning, mixed precision training, and advanced distributed optimization strategies
NVIDIA Apex (library)
Supplies optimized CUDA kernels for mixed precision training and fused operations to accelerate transformer computations
CUDA (runtime)
GPU computing platform for accelerated tensor operations and custom kernel execution in transformer layers
MPI/NCCL (infra)
Handles low-level inter-GPU and inter-node communication for distributed training coordination
NumPy (library)
Supports dataset preprocessing, indexing operations, and array manipulations in data pipeline

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is Megatron-DeepSpeed used for?

Trains transformer language models at scale using distributed computing with pipeline and tensor parallelism bigscience-workshop/megatron-deepspeed is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 170 files.

How is Megatron-DeepSpeed architected?

Megatron-DeepSpeed is organized into 5 architecture layers: Training Entry Points, Model Architecture Layer, Distributed Training Infrastructure, Data Processing Pipeline, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through Megatron-DeepSpeed?

Data moves through 7 stages: Initialize Distributed Environment → Load and Prepare Training Data → Sample Training Batches → Execute Forward Pass Through Pipeline → Accumulate Gradients Across Microbatches → .... Training data enters as preprocessed binary files that are memory-mapped and sampled in distributed fashion across data-parallel ranks. Text tokens flow through transformer layers that are split across pipeline stages, with activations passed between stages and gradients accumulated across microbatches. The system alternates between forward passes (computing loss), backward passes (computing gradients), and optimizer steps (updating parameters), with periodic checkpointing of the complete distributed state. This pipeline design reflects a complex multi-stage processing system.

What technologies does Megatron-DeepSpeed use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, and distributed communication primitives for model training), DeepSpeed (Enables ZeRO optimizer state partitioning, mixed precision training, and advanced distributed optimization strategies), NVIDIA Apex (Supplies optimized CUDA kernels for mixed precision training and fused operations to accelerate transformer computations), CUDA (GPU computing platform for accelerated tensor operations and custom kernel execution in transformer layers), MPI/NCCL (Handles low-level inter-GPU and inter-node communication for distributed training coordination), NumPy (Supports dataset preprocessing, indexing operations, and array manipulations in data pipeline). A focused set of dependencies that keeps the build manageable.

What system dynamics does Megatron-DeepSpeed have?

Megatron-DeepSpeed exhibits 4 data pools (Indexed Training Dataset, Distributed Model Parameters), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does Megatron-DeepSpeed use?

5 design patterns detected: Model Parallelism Strategy, Gradient Accumulation with Microbatching, Memory-Mapped Dataset Access, Distributed Checkpoint State Management, Mixed Precision Training with Loss Scaling.

Analyzed on April 20, 2026 by CodeSea. Written by .