nvidia/megatron-lm

Ongoing research training transformer models at scale

16,102 stars Python 8 components

Trains massive transformer models across hundreds of GPUs using optimized parallelism strategies

Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component ml training. 932 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training.

  1. Load and tokenize training data — BlendedMegatronDatasetBuilder reads text files, applies tokenization using the specified tokenizer (GPT2, SentencePiece, etc.), and creates batches with proper padding and attention masks (config: data.data_path, data.seq_length, training.micro_batch_size)
  2. Distribute batch across model parallel ranks — The input batch is split across tensor parallel groups, with each GPU receiving a subset of the attention heads and MLP weights according to tensor_model_parallel_size [Batch → Batch] (config: model.tensor_model_parallel_size, model.sequence_parallel)
  3. Forward pass through transformer layers — Each TransformerLayer applies multi-head attention and MLP operations in parallel, with activations flowing through pipeline parallel stages if enabled [Batch → Batch] (config: model.num_layers, model.hidden_size, model.num_attention_heads +1)
  4. Compute language modeling loss — The final layer outputs logits which are compared against shifted input tokens using cross-entropy loss, with loss values averaged across data parallel groups [Batch] (config: training.loss_scale, model.vocab_size)
  5. Backward pass and gradient synchronization — Gradients are computed via automatic differentiation and synchronized across data parallel ranks using all-reduce operations, with gradient clipping applied (config: training.clip_grad, training.data_parallel_size)
  6. Parameter updates and learning rate scheduling — OptimizerParamScheduler updates learning rates based on step count and schedule, then the optimizer (Adam, SGD, etc.) updates model parameters using computed gradients (config: training.lr, training.lr_decay_style, training.min_lr +1)
  7. Checkpoint saving and validation — Every save_interval steps, DistCheckpointing saves model state, optimizer state, and training metadata across multiple files sharded by tensor parallelism [CheckpointableState → CheckpointableState] (config: training.save_interval, training.save)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Batch megatron/core/datasets/
dict with 'text': Tensor[B, seq_len] (token IDs), 'attention_mask': Tensor[B, seq_len] (padding mask), 'position_ids': Tensor[B, seq_len] (positional encodings), 'labels': Tensor[B, seq_len] (target tokens for loss)
Created by dataset loaders from raw text, tokenized and padded to fixed sequence length, consumed by model forward pass for gradient computation
TransformerConfig megatron/core/transformer/transformer_config.py
dataclass with num_layers: int, hidden_size: int, num_attention_heads: int, vocab_size: int, sequence_parallel: bool, tensor_model_parallel_size: int, plus precision and optimization settings
Built from command line arguments and config files, used to construct transformer layers with proper parallelism configuration
DistributedState megatron/core/parallel_state.py
global state tracking tensor_model_parallel_rank: int, pipeline_model_parallel_rank: int, data_parallel_rank: int, process groups for each parallelism dimension
Initialized once during startup based on world size and parallelism configuration, accessed throughout training for collective operations
InferenceRequest megatron/core/inference/inference_request.py
dataclass with prompt_tokens: List[int], sampling_params: SamplingParams (temperature, top_k, max_tokens), request_id: str, arrival_time: float
Created from user prompts, queued in inference engine, batched with other requests, processed to generate completions
CheckpointableState megatron/core/dist_checkpointing/
nested dict with 'model': model parameters, 'optimizer': optimizer state, 'lr_scheduler': learning rate state, 'args': training arguments, 'iteration': current step
Periodically saved during training with distributed sharding, loaded when resuming or converting to other formats

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Scale unguarded

Function assumes all ranks have synchronized clocks when broadcasting current time across ranks - if ranks have clock skew >100ms, time.time_ns() values will differ significantly before broadcast

If this fails: Request timing, latency measurements, and timeout calculations become inconsistent across the cluster, leading to premature timeouts or incorrect performance metrics

examples/inference/gpt/utils.py:get_curr_time
critical Resource unguarded

Function assumes all in-flight requests will complete within a reasonable time when calling await asyncio.gather(*futures) - no timeout is set for the gather operation

If this fails: If any request hangs or takes extremely long, the suspend/resume cycle blocks indefinitely, preventing training cycles and making the system unresponsive

examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py:suspend_resume_cycle
critical Ordering unguarded

Function assumes the engine state transitions PAUSED -> SUSPENDED happen atomically and that client.pause_engines() always succeeds before wait_until(EngineState.PAUSED)

If this fails: If pause_engines() fails or engine doesn't reach PAUSED state, wait_until() hangs forever and subsequent suspend_engines() operates on wrong engine state, corrupting the training/inference cycle

examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py:suspend_resume_cycle
warning Environment weakly guarded

Function assumes torch.distributed.is_initialized() accurately reflects whether the process is part of a distributed setup, but doesn't verify CUDA devices are available when using torch.cuda.LongTensor

If this fails: On CPU-only nodes or when CUDA is unavailable, torch.cuda.LongTensor() raises RuntimeError even if distributed training is properly initialized, breaking time synchronization

examples/inference/gpt/utils.py:get_curr_time
warning Contract unguarded

Function assumes termination_id parameter corresponds to a valid token ID in the model's vocabulary when not None, but doesn't validate this against any tokenizer

If this fails: Invalid termination_id causes generation to never terminate properly, leading to maximum token generation and wasted compute resources for every request using these params

examples/inference/gpt/utils.py:get_default_sampling_params
warning Temporal weakly guarded

The suspend_resume_cycle assumes engine state transitions are immediate - no delays are accounted for between pause_engines(), wait_until(PAUSED), and suspend_engines() calls

If this fails: Race conditions where training attempts to start before engine is fully suspended, or inference requests arrive during state transitions, leading to mixed training/inference workloads and corrupted model state

examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py
warning Resource unguarded

Static inference assumes sufficient GPU memory is available for the entire batch of requests without any memory planning or validation

If this fails: Out-of-memory errors occur silently during forward pass when batch size * sequence length exceeds available GPU memory, causing inference jobs to crash without helpful error messages

examples/inference/gpt/gpt_static_inference.py
info Scale unguarded

Default num_tokens_to_generate=30 is hardcoded without considering model size, available memory, or request context length

If this fails: Large models or long input sequences may exceed memory limits when generating 30 tokens, while short responses waste computation by generating unnecessary padding tokens

examples/inference/gpt/utils.py:get_default_sampling_params
info Domain unguarded

Time conversion from nanoseconds to seconds using / 10**9 assumes high precision is needed, but doesn't account for floating point precision limits in downstream time calculations

If this fails: Sub-microsecond timing differences get lost in floating point operations, making fine-grained performance profiling less accurate than expected

examples/inference/gpt/utils.py:get_curr_time
info Contract unguarded

The import of Request, build_dynamic_engine_setup_prefix, build_requests assumes these utilities exist and have stable interfaces, but no version checking or graceful fallbacks exist

If this fails: Breaking changes in utility functions cause import errors that crash the entire inference service, with no indication of which specific utility caused the failure

examples/inference/gpt/gpt_dynamic_inference.py

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Parameters (state-store)
Distributed transformer weights and biases sharded across tensor parallel ranks, with each GPU storing a subset of attention heads and MLP weights
Optimizer State (state-store)
Adam momentum and variance estimates for each parameter, distributed according to model parallelism to minimize memory overhead
Data Cache (cache)
Preprocessed and tokenized training batches cached to avoid repeated tokenization during multi-epoch training
KV Cache (buffer)
Cached key-value pairs from attention layers during inference to avoid recomputation for autoregressive generation

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Provides tensor operations, automatic differentiation, and CUDA kernels for distributed transformer training
NCCL (infra)
Handles inter-GPU communication for gradient synchronization and parameter updates across data parallel groups
Transformer Engine (library)
Optimized CUDA implementations of attention and MLP layers with FP8 precision support for improved performance
Flash Attention (compute)
Memory-efficient attention computation that reduces VRAM usage and speeds up training of long sequences
Apex (library)
Mixed precision training utilities and optimized Adam implementation for reduced memory usage
Weights & Biases (infra)
Tracks training metrics, hyperparameters, and system performance across distributed training runs

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is Megatron-LM used for?

Trains massive transformer models across hundreds of GPUs using optimized parallelism strategies nvidia/megatron-lm is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 932 files.

How is Megatron-LM architected?

Megatron-LM is organized into 3 architecture layers: Core Infrastructure, Model Architectures, Applications & Examples. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through Megatron-LM?

Data moves through 7 stages: Load and tokenize training data → Distribute batch across model parallel ranks → Forward pass through transformer layers → Compute language modeling loss → Backward pass and gradient synchronization → .... Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training. This pipeline design reflects a complex multi-stage processing system.

What technologies does Megatron-LM use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, and CUDA kernels for distributed transformer training), NCCL (Handles inter-GPU communication for gradient synchronization and parameter updates across data parallel groups), Transformer Engine (Optimized CUDA implementations of attention and MLP layers with FP8 precision support for improved performance), Flash Attention (Memory-efficient attention computation that reduces VRAM usage and speeds up training of long sequences), Apex (Mixed precision training utilities and optimized Adam implementation for reduced memory usage), Weights & Biases (Tracks training metrics, hyperparameters, and system performance across distributed training runs). A focused set of dependencies that keeps the build manageable.

What system dynamics does Megatron-LM have?

Megatron-LM exhibits 4 data pools (Model Parameters, Optimizer State), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does Megatron-LM use?

5 design patterns detected: Tensor Parallelism, Pipeline Parallelism, Distributed Checkpointing, Mixed Precision Training, Dynamic Batching.

Analyzed on April 20, 2026 by CodeSea. Written by .