nvidia/megatron-lm

Ongoing research training transformer models at scale

16,102 stars Python 8 components

Trains massive transformer models across hundreds of GPUs using optimized parallelism strategies

Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component ml training. 932 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Load and tokenize training data — BlendedMegatronDatasetBuilder reads text files, applies tokenization using the specified tokenizer (GPT2, SentencePiece, etc.), and creates batches with proper padding and attention masks (config: data.data_path, data.seq_length, training.micro_batch_size)
Distribute batch across model parallel ranks — The input batch is split across tensor parallel groups, with each GPU receiving a subset of the attention heads and MLP weights according to tensor_model_parallel_size [Batch → Batch] (config: model.tensor_model_parallel_size, model.sequence_parallel)
Forward pass through transformer layers — Each TransformerLayer applies multi-head attention and MLP operations in parallel, with activations flowing through pipeline parallel stages if enabled [Batch → Batch] (config: model.num_layers, model.hidden_size, model.num_attention_heads +1)
Compute language modeling loss — The final layer outputs logits which are compared against shifted input tokens using cross-entropy loss, with loss values averaged across data parallel groups [Batch] (config: training.loss_scale, model.vocab_size)
Backward pass and gradient synchronization — Gradients are computed via automatic differentiation and synchronized across data parallel ranks using all-reduce operations, with gradient clipping applied (config: training.clip_grad, training.data_parallel_size)
Parameter updates and learning rate scheduling — OptimizerParamScheduler updates learning rates based on step count and schedule, then the optimizer (Adam, SGD, etc.) updates model parameters using computed gradients (config: training.lr, training.lr_decay_style, training.min_lr +1)
Checkpoint saving and validation — Every save_interval steps, DistCheckpointing saves model state, optimizer state, and training metadata across multiple files sharded by tensor parallelism [CheckpointableState → CheckpointableState] (config: training.save_interval, training.save)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Batch megatron/core/datasets/
dict with 'text': Tensor[B, seq_len] (token IDs), 'attention_mask': Tensor[B, seq_len] (padding mask), 'position_ids': Tensor[B, seq_len] (positional encodings), 'labels': Tensor[B, seq_len] (target tokens for loss)
Created by dataset loaders from raw text, tokenized and padded to fixed sequence length, consumed by model forward pass for gradient computation

TransformerConfig megatron/core/transformer/transformer_config.py
dataclass with num_layers: int, hidden_size: int, num_attention_heads: int, vocab_size: int, sequence_parallel: bool, tensor_model_parallel_size: int, plus precision and optimization settings
Built from command line arguments and config files, used to construct transformer layers with proper parallelism configuration

DistributedState megatron/core/parallel_state.py
global state tracking tensor_model_parallel_rank: int, pipeline_model_parallel_rank: int, data_parallel_rank: int, process groups for each parallelism dimension
Initialized once during startup based on world size and parallelism configuration, accessed throughout training for collective operations

InferenceRequest megatron/core/inference/inference_request.py
dataclass with prompt_tokens: List[int], sampling_params: SamplingParams (temperature, top_k, max_tokens), request_id: str, arrival_time: float
Created from user prompts, queued in inference engine, batched with other requests, processed to generate completions

CheckpointableState megatron/core/dist_checkpointing/
nested dict with 'model': model parameters, 'optimizer': optimizer state, 'lr_scheduler': learning rate state, 'args': training arguments, 'iteration': current step
Periodically saved during training with distributed sharding, loaded when resuming or converting to other formats

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Scale unguarded

Function assumes all ranks have synchronized clocks when broadcasting current time across ranks - if ranks have clock skew >100ms, time.time_ns() values will differ significantly before broadcast

If this fails: Request timing, latency measurements, and timeout calculations become inconsistent across the cluster, leading to premature timeouts or incorrect performance metrics

examples/inference/gpt/utils.py:get_curr_time

critical Resource unguarded

Function assumes all in-flight requests will complete within a reasonable time when calling await asyncio.gather(*futures) - no timeout is set for the gather operation

If this fails: If any request hangs or takes extremely long, the suspend/resume cycle blocks indefinitely, preventing training cycles and making the system unresponsive

examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py:suspend_resume_cycle

critical Ordering unguarded

Function assumes the engine state transitions PAUSED -> SUSPENDED happen atomically and that client.pause_engines() always succeeds before wait_until(EngineState.PAUSED)

If this fails: If pause_engines() fails or engine doesn't reach PAUSED state, wait_until() hangs forever and subsequent suspend_engines() operates on wrong engine state, corrupting the training/inference cycle

examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py:suspend_resume_cycle

warning Environment weakly guarded

Function assumes torch.distributed.is_initialized() accurately reflects whether the process is part of a distributed setup, but doesn't verify CUDA devices are available when using torch.cuda.LongTensor

If this fails: On CPU-only nodes or when CUDA is unavailable, torch.cuda.LongTensor() raises RuntimeError even if distributed training is properly initialized, breaking time synchronization

examples/inference/gpt/utils.py:get_curr_time

warning Contract unguarded

Function assumes termination_id parameter corresponds to a valid token ID in the model's vocabulary when not None, but doesn't validate this against any tokenizer

If this fails: Invalid termination_id causes generation to never terminate properly, leading to maximum token generation and wasted compute resources for every request using these params

examples/inference/gpt/utils.py:get_default_sampling_params

warning Temporal weakly guarded

The suspend_resume_cycle assumes engine state transitions are immediate - no delays are accounted for between pause_engines(), wait_until(PAUSED), and suspend_engines() calls

If this fails: Race conditions where training attempts to start before engine is fully suspended, or inference requests arrive during state transitions, leading to mixed training/inference workloads and corrupted model state

examples/inference/gpt/gpt_dynamic_inference_with_coordinator.py

warning Resource unguarded

Static inference assumes sufficient GPU memory is available for the entire batch of requests without any memory planning or validation

If this fails: Out-of-memory errors occur silently during forward pass when batch size * sequence length exceeds available GPU memory, causing inference jobs to crash without helpful error messages

examples/inference/gpt/gpt_static_inference.py

info Scale unguarded

Default num_tokens_to_generate=30 is hardcoded without considering model size, available memory, or request context length

If this fails: Large models or long input sequences may exceed memory limits when generating 30 tokens, while short responses waste computation by generating unnecessary padding tokens

examples/inference/gpt/utils.py:get_default_sampling_params

info Domain unguarded

Time conversion from nanoseconds to seconds using / 10**9 assumes high precision is needed, but doesn't account for floating point precision limits in downstream time calculations

If this fails: Sub-microsecond timing differences get lost in floating point operations, making fine-grained performance profiling less accurate than expected

examples/inference/gpt/utils.py:get_curr_time

info Contract unguarded

The import of Request, build_dynamic_engine_setup_prefix, build_requests assumes these utilities exist and have stable interfaces, but no version checking or graceful fallbacks exist

If this fails: Breaking changes in utility functions cause import errors that crash the entire inference service, with no indication of which specific utility caused the failure

examples/inference/gpt/gpt_dynamic_inference.py

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Parameters (state-store)
Distributed transformer weights and biases sharded across tensor parallel ranks, with each GPU storing a subset of attention heads and MLP weights

Optimizer State (state-store)
Adam momentum and variance estimates for each parameter, distributed according to model parallelism to minimize memory overhead

Data Cache (cache)
Preprocessed and tokenized training batches cached to avoid repeated tokenization during multi-epoch training

KV Cache (buffer)
Cached key-value pairs from attention layers during inference to avoid recomputation for autoregressive generation

Feedback Loops

Training Loop (training-loop, reinforcing) — Trigger: Training script execution with specified number of iterations. Action: Loads batch, runs forward/backward pass, updates parameters, saves checkpoints. Exit: Reaches max_train_iters or manual termination.
Gradient Accumulation (gradient-accumulation, balancing) — Trigger: When global_batch_size > micro_batch_size * data_parallel_size. Action: Accumulates gradients across multiple micro-batches before parameter update. Exit: After gradient_accumulation_fusion_steps micro-batches processed.
Dynamic Batching Loop (auto-scale, balancing) — Trigger: New inference requests arrive at DynamicInferenceEngine. Action: Batches variable-length requests together, adjusts batch size based on memory constraints. Exit: When engine is suspended or no pending requests.
Learning Rate Schedule (convergence, balancing) — Trigger: Each training step based on current iteration count. Action: Adjusts learning rate according to warmup, decay, and minimum LR settings. Exit: When training completes or manual override.

Delays

Checkpoint Saving (checkpoint-save, ~Minutes for large models) — Training pauses while model state is serialized and written to disk across multiple files
Data Loading (batch-window, ~Milliseconds per batch) — GPUs wait for next batch to be tokenized and moved to device memory
Pipeline Bubble (async-processing, ~Proportional to pipeline_model_parallel_size) — Some pipeline stages idle while others process their portion of the forward/backward pass
Distributed Synchronization (eventual-consistency, ~Microseconds to milliseconds) — All-reduce operations block until all ranks contribute their gradients

Control Points

tensor_model_parallel_size (architecture-switch) — Controls: How model weights are sharded across GPUs - affects memory usage and communication patterns. Default: 1
sequence_parallel (feature-flag) — Controls: Whether to split sequence dimension across tensor parallel ranks to reduce memory usage. Default: False
precision_mode (precision-mode) — Controls: Training precision (fp32, fp16, bf16) affecting memory usage and numerical stability. Default: bf16
recompute_granularity (runtime-toggle) — Controls: Whether to recompute activations during backward pass to trade compute for memory
use_distributed_optimizer (feature-flag) — Controls: Whether to shard optimizer states across data parallel ranks. Default: False

Technology Stack

PyTorch (framework)
Provides tensor operations, automatic differentiation, and CUDA kernels for distributed transformer training

NCCL (infra)
Handles inter-GPU communication for gradient synchronization and parameter updates across data parallel groups

Transformer Engine (library)
Optimized CUDA implementations of attention and MLP layers with FP8 precision support for improved performance

Flash Attention (compute)
Memory-efficient attention computation that reduces VRAM usage and speeds up training of long sequences

Apex (library)
Mixed precision training utilities and optimized Adam implementation for reduced memory usage

Weights & Biases (infra)
Tracks training metrics, hyperparameters, and system performance across distributed training runs

Key Components

parallel_state (orchestrator) — Manages distributed training topology by tracking which GPU handles which model shards and coordinating communication between ranks megatron/core/parallel_state.py
MegatronDataset (loader) — Loads and preprocesses training data from various formats (JSON, parquet, raw text) into tokenized batches with proper padding and attention masks megatron/core/datasets/
TransformerLayer (processor) — Implements the core transformer computation (attention + MLP) with tensor parallelism, sequence parallelism, and optimized CUDA kernels megatron/core/transformer/transformer_layer.py
OptimizerParamScheduler (scheduler) — Manages learning rate scheduling, gradient clipping, and optimizer state across distributed training with warmup and decay phases megatron/training/training.py
DistCheckpointing (store) — Saves and loads model checkpoints across multiple GPUs by sharding state according to tensor parallelism and pipeline parallelism configuration megatron/core/dist_checkpointing/
DynamicInferenceEngine (executor) — Manages inference workload by batching variable-length requests, optimizing memory allocation, and coordinating generation across model parallel ranks megatron/core/inference/engines/dynamic_engine.py
ModelParallelGPT (processor) — Implements the complete GPT model architecture with embedding, transformer layers, and language modeling head distributed across tensor parallel ranks megatron/legacy/model/gpt_model.py
BlendedMegatronDatasetBuilder (factory) — Constructs datasets from multiple data sources with configurable blending weights, handles sampling strategies and data validation megatron/core/datasets/blended_megatron_dataset_builder.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is Megatron-LM used for?

Trains massive transformer models across hundreds of GPUs using optimized parallelism strategies nvidia/megatron-lm is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 932 files.

How is Megatron-LM architected?

Megatron-LM is organized into 3 architecture layers: Core Infrastructure, Model Architectures, Applications & Examples. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through Megatron-LM?

Data moves through 7 stages: Load and tokenize training data → Distribute batch across model parallel ranks → Forward pass through transformer layers → Compute language modeling loss → Backward pass and gradient synchronization → .... Training begins by loading raw text data through BlendedMegatronDatasetBuilder which tokenizes and creates batches. These batches flow through the distributed GPT model where TransformerLayers apply attention and MLP operations in parallel across GPUs. The forward pass produces logits that are compared against labels to compute cross-entropy loss. Gradients are backpropagated and synchronized across data parallel ranks, then the optimizer updates model parameters. Checkpoints are periodically saved using distributed checkpointing to resume training. This pipeline design reflects a complex multi-stage processing system.

What technologies does Megatron-LM use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, and CUDA kernels for distributed transformer training), NCCL (Handles inter-GPU communication for gradient synchronization and parameter updates across data parallel groups), Transformer Engine (Optimized CUDA implementations of attention and MLP layers with FP8 precision support for improved performance), Flash Attention (Memory-efficient attention computation that reduces VRAM usage and speeds up training of long sequences), Apex (Mixed precision training utilities and optimized Adam implementation for reduced memory usage), Weights & Biases (Tracks training metrics, hyperparameters, and system performance across distributed training runs). A focused set of dependencies that keeps the build manageable.

What system dynamics does Megatron-LM have?

Megatron-LM exhibits 4 data pools (Model Parameters, Optimizer State), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does Megatron-LM use?

5 design patterns detected: Tensor Parallelism, Pipeline Parallelism, Distributed Checkpointing, Mixed Precision Training, Dynamic Batching.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

nvidia/megatron-lm

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Explore the interactive analysis

Related Ml Training Repositories

tensorflow/tensorflow

automatic1111/stable-diffusion-webui

huggingface/transformers

ggml-org/llama.cpp

pytorch/pytorch

openai/whisper

Frequently Asked Questions