nvidia/megatron-lm
Ongoing research training transformer models at scale
Trains massive transformer models across hundreds of GPUs using optimized parallelism strategies
Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training.
Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 8-component ml training. 932 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training.
- Load and tokenize training data — BlendedMegatronDatasetBuilder reads text files, applies tokenization using the specified tokenizer (GPT2, SentencePiece, etc.), and creates batches with proper padding and attention masks (config: data.data_path, data.seq_length, training.micro_batch_size)
- Distribute batch across model parallel ranks — The input batch is split across tensor parallel groups, with each GPU receiving a subset of the attention heads and MLP weights according to tensor_model_parallel_size [Batch → Batch] (config: model.tensor_model_parallel_size, model.sequence_parallel)
- Forward pass through transformer layers — Each TransformerLayer applies multi-head attention and MLP operations in parallel, with activations flowing through pipeline parallel stages if enabled [Batch → Batch] (config: model.num_layers, model.hidden_size, model.num_attention_heads +1)
- Compute language modeling loss — The final layer outputs logits which are compared against shifted input tokens using cross-entropy loss, with loss values averaged across data parallel groups [Batch] (config: training.loss_scale, model.vocab_size)
- Backward pass and gradient synchronization — Gradients are computed via automatic differentiation and synchronized across data parallel ranks using all-reduce operations, with gradient clipping applied (config: training.clip_grad, training.data_parallel_size)
- Parameter updates and learning rate scheduling — OptimizerParamScheduler updates learning rates based on step count and schedule, then the optimizer (Adam, SGD, etc.) updates model parameters using computed gradients (config: training.lr, training.lr_decay_style, training.min_lr +1)
- Checkpoint saving and validation — Every save_interval steps, DistCheckpointing saves model state, optimizer state, and training metadata across multiple files sharded by tensor parallelism [CheckpointableState → CheckpointableState] (config: training.save_interval, training.save)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
megatron/core/datasets/dict with 'text': Tensor[B, seq_len] (token IDs), 'attention_mask': Tensor[B, seq_len] (padding mask), 'position_ids': Tensor[B, seq_len] (positional encodings), 'labels': Tensor[B, seq_len] (target tokens for loss)
Created by dataset loaders from raw text, tokenized and padded to fixed sequence length, consumed by model forward pass for gradient computation
megatron/core/transformer/transformer_config.pydataclass with num_layers: int, hidden_size: int, num_attention_heads: int, vocab_size: int, sequence_parallel: bool, tensor_model_parallel_size: int, plus precision and optimization settings
Built from command line arguments and config files, used to construct transformer layers with proper parallelism configuration
megatron/core/parallel_state.pyglobal state tracking tensor_model_parallel_rank: int, pipeline_model_parallel_rank: int, data_parallel_rank: int, process groups for each parallelism dimension
Initialized once during startup based on world size and parallelism configuration, accessed throughout training for collective operations
megatron/core/inference/inference_request.pydataclass with prompt_tokens: List[int], sampling_params: SamplingParams (temperature, top_k, max_tokens), request_id: str, arrival_time: float
Created from user prompts, queued in inference engine, batched with other requests, processed to generate completions
megatron/core/dist_checkpointing/nested dict with 'model': model parameters, 'optimizer': optimizer state, 'lr_scheduler': learning rate state, 'args': training arguments, 'iteration': current step
Periodically saved during training with distributed sharding, loaded when resuming or converting to other formats
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Function assumes all ranks have synchronized clocks when broadcasting current time across ranks - if ranks have clock skew >100ms, time.time_ns() values will differ significantly before broadcast
If this fails: Request timing, latency measurements, and timeout calculations become inconsistent across the cluster, leading to premature timeouts or incorrect performance metrics
examples/inference/gpt/utils.py:get_curr_time
Function assumes all in-flight requests will complete within a reasonable time when calling await asyncio.gather(*futures) - no timeout is set for the gather operation
If this fails: If any request hangs or takes extremely long, the suspend/resume cycle blocks indefinitely, preventing training cycles and making the system unresponsive
examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py:suspend_resume_cycle
Function assumes the engine state transitions PAUSED -> SUSPENDED happen atomically and that client.pause_engines() always succeeds before wait_until(EngineState.PAUSED)
If this fails: If pause_engines() fails or engine doesn't reach PAUSED state, wait_until() hangs forever and subsequent suspend_engines() operates on wrong engine state, corrupting the training/inference cycle
examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py:suspend_resume_cycle
Function assumes torch.distributed.is_initialized() accurately reflects whether the process is part of a distributed setup, but doesn't verify CUDA devices are available when using torch.cuda.LongTensor
If this fails: On CPU-only nodes or when CUDA is unavailable, torch.cuda.LongTensor() raises RuntimeError even if distributed training is properly initialized, breaking time synchronization
examples/inference/gpt/utils.py:get_curr_time
Function assumes termination_id parameter corresponds to a valid token ID in the model's vocabulary when not None, but doesn't validate this against any tokenizer
If this fails: Invalid termination_id causes generation to never terminate properly, leading to maximum token generation and wasted compute resources for every request using these params
examples/inference/gpt/utils.py:get_default_sampling_params
The suspend_resume_cycle assumes engine state transitions are immediate - no delays are accounted for between pause_engines(), wait_until(PAUSED), and suspend_engines() calls
If this fails: Race conditions where training attempts to start before engine is fully suspended, or inference requests arrive during state transitions, leading to mixed training/inference workloads and corrupted model state
examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py
Static inference assumes sufficient GPU memory is available for the entire batch of requests without any memory planning or validation
If this fails: Out-of-memory errors occur silently during forward pass when batch size * sequence length exceeds available GPU memory, causing inference jobs to crash without helpful error messages
examples/inference/gpt/gpt_static_inference.py
Default num_tokens_to_generate=30 is hardcoded without considering model size, available memory, or request context length
If this fails: Large models or long input sequences may exceed memory limits when generating 30 tokens, while short responses waste computation by generating unnecessary padding tokens
examples/inference/gpt/utils.py:get_default_sampling_params
Time conversion from nanoseconds to seconds using / 10**9 assumes high precision is needed, but doesn't account for floating point precision limits in downstream time calculations
If this fails: Sub-microsecond timing differences get lost in floating point operations, making fine-grained performance profiling less accurate than expected
examples/inference/gpt/utils.py:get_curr_time
The import of Request, build_dynamic_engine_setup_prefix, build_requests assumes these utilities exist and have stable interfaces, but no version checking or graceful fallbacks exist
If this fails: Breaking changes in utility functions cause import errors that crash the entire inference service, with no indication of which specific utility caused the failure
examples/inference/gpt/gpt_dynamic_inference.py
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Distributed transformer weights and biases sharded across tensor parallel ranks, with each GPU storing a subset of attention heads and MLP weights
Adam momentum and variance estimates for each parameter, distributed according to model parallelism to minimize memory overhead
Preprocessed and tokenized training batches cached to avoid repeated tokenization during multi-epoch training
Cached key-value pairs from attention layers during inference to avoid recomputation for autoregressive generation
Feedback Loops
- Training Loop (training-loop, reinforcing) — Trigger: Training script execution with specified number of iterations. Action: Loads batch, runs forward/backward pass, updates parameters, saves checkpoints. Exit: Reaches max_train_iters or manual termination.
- Gradient Accumulation (gradient-accumulation, balancing) — Trigger: When global_batch_size > micro_batch_size * data_parallel_size. Action: Accumulates gradients across multiple micro-batches before parameter update. Exit: After gradient_accumulation_fusion_steps micro-batches processed.
- Dynamic Batching Loop (auto-scale, balancing) — Trigger: New inference requests arrive at DynamicInferenceEngine. Action: Batches variable-length requests together, adjusts batch size based on memory constraints. Exit: When engine is suspended or no pending requests.
- Learning Rate Schedule (convergence, balancing) — Trigger: Each training step based on current iteration count. Action: Adjusts learning rate according to warmup, decay, and minimum LR settings. Exit: When training completes or manual override.
Delays
- Checkpoint Saving (checkpoint-save, ~Minutes for large models) — Training pauses while model state is serialized and written to disk across multiple files
- Data Loading (batch-window, ~Milliseconds per batch) — GPUs wait for next batch to be tokenized and moved to device memory
- Pipeline Bubble (async-processing, ~Proportional to pipeline_model_parallel_size) — Some pipeline stages idle while others process their portion of the forward/backward pass
- Distributed Synchronization (eventual-consistency, ~Microseconds to milliseconds) — All-reduce operations block until all ranks contribute their gradients
Control Points
- tensor_model_parallel_size (architecture-switch) — Controls: How model weights are sharded across GPUs - affects memory usage and communication patterns. Default: 1
- sequence_parallel (feature-flag) — Controls: Whether to split sequence dimension across tensor parallel ranks to reduce memory usage. Default: False
- precision_mode (precision-mode) — Controls: Training precision (fp32, fp16, bf16) affecting memory usage and numerical stability. Default: bf16
- recompute_granularity (runtime-toggle) — Controls: Whether to recompute activations during backward pass to trade compute for memory
- use_distributed_optimizer (feature-flag) — Controls: Whether to shard optimizer states across data parallel ranks. Default: False
Technology Stack
Provides tensor operations, automatic differentiation, and CUDA kernels for distributed transformer training
Handles inter-GPU communication for gradient synchronization and parameter updates across data parallel groups
Optimized CUDA implementations of attention and MLP layers with FP8 precision support for improved performance
Memory-efficient attention computation that reduces VRAM usage and speeds up training of long sequences
Mixed precision training utilities and optimized Adam implementation for reduced memory usage
Tracks training metrics, hyperparameters, and system performance across distributed training runs
Key Components
- parallel_state (orchestrator) — Manages distributed training topology by tracking which GPU handles which model shards and coordinating communication between ranks
megatron/core/parallel_state.py - MegatronDataset (loader) — Loads and preprocesses training data from various formats (JSON, parquet, raw text) into tokenized batches with proper padding and attention masks
megatron/core/datasets/ - TransformerLayer (processor) — Implements the core transformer computation (attention + MLP) with tensor parallelism, sequence parallelism, and optimized CUDA kernels
megatron/core/transformer/transformer_layer.py - OptimizerParamScheduler (scheduler) — Manages learning rate scheduling, gradient clipping, and optimizer state across distributed training with warmup and decay phases
megatron/training/training.py - DistCheckpointing (store) — Saves and loads model checkpoints across multiple GPUs by sharding state according to tensor parallelism and pipeline parallelism configuration
megatron/core/dist_checkpointing/ - DynamicInferenceEngine (executor) — Manages inference workload by batching variable-length requests, optimizing memory allocation, and coordinating generation across model parallel ranks
megatron/core/inference/engines/dynamic_engine.py - ModelParallelGPT (processor) — Implements the complete GPT model architecture with embedding, transformer layers, and language modeling head distributed across tensor parallel ranks
megatron/legacy/model/gpt_model.py - BlendedMegatronDatasetBuilder (factory) — Constructs datasets from multiple data sources with configurable blending weights, handles sampling strategies and data validation
megatron/core/datasets/blended_megatron_dataset_builder.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is Megatron-LM used for?
Trains massive transformer models across hundreds of GPUs using optimized parallelism strategies nvidia/megatron-lm is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 932 files.
How is Megatron-LM architected?
Megatron-LM is organized into 3 architecture layers: Core Infrastructure, Model Architectures, Applications & Examples. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through Megatron-LM?
Data moves through 7 stages: Load and tokenize training data → Distribute batch across model parallel ranks → Forward pass through transformer layers → Compute language modeling loss → Backward pass and gradient synchronization → .... Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training. This pipeline design reflects a complex multi-stage processing system.
What technologies does Megatron-LM use?
The core stack includes PyTorch (Provides tensor operations, automatic differentiation, and CUDA kernels for distributed transformer training), NCCL (Handles inter-GPU communication for gradient synchronization and parameter updates across data parallel groups), Transformer Engine (Optimized CUDA implementations of attention and MLP layers with FP8 precision support for improved performance), Flash Attention (Memory-efficient attention computation that reduces VRAM usage and speeds up training of long sequences), Apex (Mixed precision training utilities and optimized Adam implementation for reduced memory usage), Weights & Biases (Tracks training metrics, hyperparameters, and system performance across distributed training runs). A focused set of dependencies that keeps the build manageable.
What system dynamics does Megatron-LM have?
Megatron-LM exhibits 4 data pools (Model Parameters, Optimizer State), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does Megatron-LM use?
5 design patterns detected: Tensor Parallelism, Pipeline Parallelism, Distributed Checkpointing, Mixed Precision Training, Dynamic Batching.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.