lightning-ai/litgpt
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Provides high-performance implementations of 20+ LLMs with training, fine-tuning, and deployment workflows
Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies.
Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
A 8-component ml training. 137 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies.
- Dataset tokenization — Raw text files are loaded by DataModule, split into chunks matching model block_size, and converted to token IDs using model-specific vocabulary via Tokenizer.encode() (config: block_size, vocab_size)
- Model forward pass — GPT.forward() processes input_ids through embedding layers, n_layer transformer blocks with attention and MLP, applying layer normalization and computing logits over vocabulary [TokenizedBatch → Raw logits tensor] (config: n_layer, n_embd, n_head)
- Loss computation — chunked_cross_entropy() computes cross-entropy loss between predicted logits and target labels, with ignore_index masking for padding tokens [Raw logits tensor → TrainingMetrics]
- Gradient computation and update — TrainingLoop calls loss.backward() to compute gradients, applies gradient clipping and accumulation, then optimizer.step() updates model parameters with learning rate scheduling [TrainingMetrics → Updated ModelState] (config: learning_rate, batch_size)
- Autoregressive generation — GenerationEngine.generate() performs iterative token prediction, sampling next tokens using temperature and top_k/top_p filtering, maintaining KV cache for efficiency [GenerationConfig → Generated text tokens] (config: max_new_tokens, temperature, top_k)
- Checkpoint persistence — CheckpointIO serializes ModelState including transformer weights and optimizer states to disk, with optional sharded saving for large models [ModelState]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
litgpt/config.pyDataclass with block_size: int (4096), n_layer: int (16), n_embd: int (4096), vocab_size: int (50254), attention params, normalization settings, and architecture-specific parameters
Created from model config files or defaults, passed to model constructors, serialized with checkpoints
litgpt/data/__init__.pyDict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len] where B is batch size and seq_len is sequence length
Created by tokenizing text samples, batched by DataLoader, consumed by model forward pass
litgpt/model.pyOrderedDict containing transformer.wte.weight: Tensor[vocab_size, n_embd], transformer.blocks.*.attn.weight: Tensor[...], lm_head.weight: Tensor[vocab_size, n_embd], and optimizer states
Extracted from trained models, saved to disk, loaded for inference or continued training
litgpt/generate/base.pyDict with max_new_tokens: int, temperature: float, top_k: int, top_p: float, eos_token_id: int controlling generation behavior
Configured by user parameters, used during autoregressive generation to control token sampling
litgpt/pretrain.pyDict with loss: float, learning_rate: float, throughput: float, memory_usage: float, gradient_norm: float tracked during training
Computed each training step, aggregated for logging, used for training diagnostics
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Triton kernels can be launched with block sizes up to 65536 (MAX_FUSED_SIZE) and assumes CUDA hardware supports this blocksize limit
If this fails: If running on older CUDA hardware with smaller maximum block sizes, kernel launch will fail with cryptic CUDA errors rather than the documented RuntimeError
extensions/thunder/unsloth/kernels/utils.py:calculate_settings
Input logits tensor has at least 1 dimension (logits.shape[0] exists) and returns loss tensor with shape (logits.shape[0],) without validating logits tensor rank
If this fails: If logits is a scalar tensor (0-dimensional), accessing logits.shape[0] will raise IndexError, causing meta-function to crash during Thunder compilation
extensions/thunder/unsloth/executor.py:unsloth_cross_entropy_meta
Cross entropy kernel only supports float32 dtype and hardcodes output dtype as thunder.dtypes.float32 regardless of input logits dtype
If this fails: If model uses bf16 or fp16 precision, silent dtype conversion occurs during loss computation potentially causing gradient scaling issues or numerical instability
extensions/thunder/unsloth/executor.py:unsloth_cross_entropy_meta
The parent directory structure exists (Path(__file__).parent.parent resolves successfully) and sys.path modification affects global Python import resolution
If this fails: If extension is moved or filesystem structure changes, imports fail with ModuleNotFoundError, and sys.path pollution can cause unexpected import behavior in other parts of the system
extensions/thunder/__init__.py:sys.path modification
RoPE embedding kernel expects Q, cos, and sin tensors to have compatible shapes where head_dim matches the last dimension, but kernel only validates through Triton's memory access patterns
If this fails: Shape mismatches between query tensor and position embeddings cause silent memory access violations or incorrect rotary position computations without explicit error messages
extensions/thunder/unsloth/kernels/rope_embedding.py:_rope_embedding
RoPE operations process attention heads in groups of exactly 4 (ROPE_GROUP_SIZE = 4), assuming head dimensions are divisible by 4
If this fails: Models with head dimensions not divisible by 4 will have incomplete rotary position encoding applied to the remaining dimensions, leading to degraded attention quality
extensions/thunder/unsloth/kernels/rope_embedding.py:ROPE_GROUP_SIZE
Triton warp allocation logic assumes CUDA GPU execution and calculates num_warps based on BLOCK_SIZE using hardcoded thresholds (32768, 8192, 2048)
If this fails: On non-CUDA devices or GPUs with different warp architectures, suboptimal warp allocation leads to poor kernel performance or launch failures
extensions/thunder/unsloth/kernels/utils.py:calculate_settings
Triton kernel expects logits_ptr, labels_ptr, loss_ptr, and logsumexp_ptr to point to valid memory regions with correct stride patterns, but performs no bounds checking
If this fails: Invalid pointers or stride mismatches cause segmentation faults or silent memory corruption during kernel execution, difficult to debug in distributed settings
extensions/thunder/unsloth/kernels/cross_entropy_loss.py:_cross_entropy_forward
SwiGLU kernel processes elements in BLOCK_SIZE chunks assuming contiguous memory layout, with mask logic depending on elements being processed in ascending offset order
If this fails: Non-contiguous tensors or unexpected memory layouts cause incorrect masking behavior, leading to wrong gradient computations during backward pass
extensions/thunder/unsloth/kernels/swiglu.py:_fg_kernel
Thunder framework is available and properly installed when ThunderDDPStrategy is imported, with no graceful degradation if Thunder is missing
If this fails: Import failures cascade through the strategy system causing training scripts to crash with ImportError rather than falling back to standard DDP
extensions/thunder/strategies/thunder_ddp.py:_THUNDER_AVAILABLE import
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Persisted model weights, optimizer states, and training metadata organized by model architecture and training run
Cached key-value tensors from previous tokens during generation to avoid recomputing attention for efficiency
Current training step, epoch, learning rate schedule state, and optimizer momentum buffers maintained across training iterations
Feedback Loops
- Training optimization loop (training-loop, reinforcing) — Trigger: Next batch from DataLoader. Action: Forward pass → loss computation → backward pass → parameter update. Exit: Configured max_steps or max_epochs reached.
- Generation sampling loop (recursive, balancing) — Trigger: Current sequence shorter than max_new_tokens and not EOS token. Action: Model predicts next token logits → sampling → append to sequence → update KV cache. Exit: EOS token generated or max_new_tokens reached.
- Learning rate scheduling (convergence, balancing) — Trigger: Training step completion. Action: Update learning rate based on schedule (cosine, linear decay) and current step. Exit: Training completion.
Delays
- Model compilation (compilation, ~First forward pass) — Initial latency spike when using torch.compile or JIT compilation
- Distributed synchronization (async-processing, ~Per gradient sync) — Gradient synchronization across ranks during distributed training
- Checkpoint saving (checkpoint-save, ~Configurable save_interval) — Training pauses briefly to serialize model state to disk
Control Points
- Model architecture selection (architecture-switch) — Controls: Which transformer variant (GPT, Llama, Falcon) and layer configuration to use. Default: Determined by config.name field
- Precision mode (precision-mode) — Controls: Floating point precision (fp32, fp16, bf16) affecting memory usage and computation speed. Default: bf16-mixed by default
- Device strategy (device-selection) — Controls: Whether to use single GPU, DDP, FSDP, or TPU acceleration. Default: Auto-detected based on available hardware
- Generation sampling strategy (sampling-strategy) — Controls: Token sampling behavior via temperature, top_k, top_p parameters. Default: temperature=0.8, top_k=50 by default
Technology Stack
Provides distributed training orchestration, device management, and training loop abstractions via Fabric
Core tensor operations, autograd, and neural network primitives for model implementation and training
Downloads pretrained model checkpoints and handles model metadata and configuration files
Fast tokenization using Rust-based implementations for various model vocabularies (BPE, SentencePiece)
Secure and efficient serialization format for model weights with memory mapping support
PyTorch compiler for optimizing model execution with fusion, memory optimization, and hardware acceleration
GPU kernel language for custom CUDA operations like optimized attention and cross-entropy implementations
High-performance inference serving with batching, streaming, and API endpoint management
Key Components
- GPT (transformer) — Core transformer model that processes input token sequences through stacked attention blocks to predict next tokens
litgpt/model.py - DataModule (loader) — Orchestrates dataset loading, tokenization, and batching with configurable data sources and preprocessing pipelines
litgpt/data/__init__.py - TrainingLoop (orchestrator) — Coordinates model training with gradient computation, parameter updates, checkpointing, and distributed synchronization
litgpt/pretrain.py - GenerationEngine (processor) — Implements autoregressive text generation with sampling strategies, caching, and various decoding algorithms
litgpt/generate/base.py - Tokenizer (encoder) — Converts between raw text and token sequences using model-specific vocabularies and special token handling
litgpt/tokenizer.py - CheckpointIO (serializer) — Handles saving and loading model states with validation, format conversion, and distributed checkpointing support
litgpt/utils.py - AdapterWrapper (adapter) — Wraps base models with parameter-efficient fine-tuning layers that add learnable prompts to attention blocks
litgpt/adapter.py - LitServeDeployment (gateway) — Provides HTTP API endpoints for model inference with request queuing, batching, and response streaming
litgpt/deploy/serve.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaCompare litgpt
Related Ml Training Repositories
Frequently Asked Questions
What is litgpt used for?
Provides high-performance implementations of 20+ LLMs with training, fine-tuning, and deployment workflows lightning-ai/litgpt is a 8-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 137 files.
How is litgpt architected?
litgpt is organized into 5 architecture layers: Core Models, Training Workflows, Data Processing, Generation & Deployment, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through litgpt?
Data moves through 6 stages: Dataset tokenization → Model forward pass → Loss computation → Gradient computation and update → Autoregressive generation → .... Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies. This pipeline design reflects a complex multi-stage processing system.
What technologies does litgpt use?
The core stack includes PyTorch Lightning (Provides distributed training orchestration, device management, and training loop abstractions via Fabric), PyTorch (Core tensor operations, autograd, and neural network primitives for model implementation and training), Hugging Face Hub (Downloads pretrained model checkpoints and handles model metadata and configuration files), Tokenizers (Fast tokenization using Rust-based implementations for various model vocabularies (BPE, SentencePiece)), SafeTensors (Secure and efficient serialization format for model weights with memory mapping support), Thunder (PyTorch compiler for optimizing model execution with fusion, memory optimization, and hardware acceleration), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does litgpt have?
litgpt exhibits 3 data pools (Model checkpoint storage, KV cache buffers), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does litgpt use?
5 design patterns detected: Parameter-efficient fine-tuning, Modular workflow dispatch, Lazy model initialization, Chunked cross-entropy, Extension acceleration.
How does litgpt compare to alternatives?
CodeSea has side-by-side architecture comparisons of litgpt with nanogpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.