lightning-ai/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

13,308 stars Python 8 components

Provides high-performance implementations of 20+ LLMs with training, fine-tuning, and deployment workflows

Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies.

Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 8-component ml training. 137 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies.

  1. Dataset tokenization — Raw text files are loaded by DataModule, split into chunks matching model block_size, and converted to token IDs using model-specific vocabulary via Tokenizer.encode() (config: block_size, vocab_size)
  2. Model forward pass — GPT.forward() processes input_ids through embedding layers, n_layer transformer blocks with attention and MLP, applying layer normalization and computing logits over vocabulary [TokenizedBatch → Raw logits tensor] (config: n_layer, n_embd, n_head)
  3. Loss computation — chunked_cross_entropy() computes cross-entropy loss between predicted logits and target labels, with ignore_index masking for padding tokens [Raw logits tensor → TrainingMetrics]
  4. Gradient computation and update — TrainingLoop calls loss.backward() to compute gradients, applies gradient clipping and accumulation, then optimizer.step() updates model parameters with learning rate scheduling [TrainingMetrics → Updated ModelState] (config: learning_rate, batch_size)
  5. Autoregressive generation — GenerationEngine.generate() performs iterative token prediction, sampling next tokens using temperature and top_k/top_p filtering, maintaining KV cache for efficiency [GenerationConfig → Generated text tokens] (config: max_new_tokens, temperature, top_k)
  6. Checkpoint persistence — CheckpointIO serializes ModelState including transformer weights and optimizer states to disk, with optional sharded saving for large models [ModelState]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Config litgpt/config.py
Dataclass with block_size: int (4096), n_layer: int (16), n_embd: int (4096), vocab_size: int (50254), attention params, normalization settings, and architecture-specific parameters
Created from model config files or defaults, passed to model constructors, serialized with checkpoints
TokenizedBatch litgpt/data/__init__.py
Dict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len] where B is batch size and seq_len is sequence length
Created by tokenizing text samples, batched by DataLoader, consumed by model forward pass
ModelState litgpt/model.py
OrderedDict containing transformer.wte.weight: Tensor[vocab_size, n_embd], transformer.blocks.*.attn.weight: Tensor[...], lm_head.weight: Tensor[vocab_size, n_embd], and optimizer states
Extracted from trained models, saved to disk, loaded for inference or continued training
GenerationConfig litgpt/generate/base.py
Dict with max_new_tokens: int, temperature: float, top_k: int, top_p: float, eos_token_id: int controlling generation behavior
Configured by user parameters, used during autoregressive generation to control token sampling
TrainingMetrics litgpt/pretrain.py
Dict with loss: float, learning_rate: float, throughput: float, memory_usage: float, gradient_norm: float tracked during training
Computed each training step, aggregated for logging, used for training diagnostics

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

Triton kernels can be launched with block sizes up to 65536 (MAX_FUSED_SIZE) and assumes CUDA hardware supports this blocksize limit

If this fails: If running on older CUDA hardware with smaller maximum block sizes, kernel launch will fail with cryptic CUDA errors rather than the documented RuntimeError

extensions/thunder/unsloth/kernels/utils.py:calculate_settings
critical Shape unguarded

Input logits tensor has at least 1 dimension (logits.shape[0] exists) and returns loss tensor with shape (logits.shape[0],) without validating logits tensor rank

If this fails: If logits is a scalar tensor (0-dimensional), accessing logits.shape[0] will raise IndexError, causing meta-function to crash during Thunder compilation

extensions/thunder/unsloth/executor.py:unsloth_cross_entropy_meta
warning Domain unguarded

Cross entropy kernel only supports float32 dtype and hardcodes output dtype as thunder.dtypes.float32 regardless of input logits dtype

If this fails: If model uses bf16 or fp16 precision, silent dtype conversion occurs during loss computation potentially causing gradient scaling issues or numerical instability

extensions/thunder/unsloth/executor.py:unsloth_cross_entropy_meta
warning Environment unguarded

The parent directory structure exists (Path(__file__).parent.parent resolves successfully) and sys.path modification affects global Python import resolution

If this fails: If extension is moved or filesystem structure changes, imports fail with ModuleNotFoundError, and sys.path pollution can cause unexpected import behavior in other parts of the system

extensions/thunder/__init__.py:sys.path modification
critical Contract weakly guarded

RoPE embedding kernel expects Q, cos, and sin tensors to have compatible shapes where head_dim matches the last dimension, but kernel only validates through Triton's memory access patterns

If this fails: Shape mismatches between query tensor and position embeddings cause silent memory access violations or incorrect rotary position computations without explicit error messages

extensions/thunder/unsloth/kernels/rope_embedding.py:_rope_embedding
warning Scale unguarded

RoPE operations process attention heads in groups of exactly 4 (ROPE_GROUP_SIZE = 4), assuming head dimensions are divisible by 4

If this fails: Models with head dimensions not divisible by 4 will have incomplete rotary position encoding applied to the remaining dimensions, leading to degraded attention quality

extensions/thunder/unsloth/kernels/rope_embedding.py:ROPE_GROUP_SIZE
info Resource unguarded

Triton warp allocation logic assumes CUDA GPU execution and calculates num_warps based on BLOCK_SIZE using hardcoded thresholds (32768, 8192, 2048)

If this fails: On non-CUDA devices or GPUs with different warp architectures, suboptimal warp allocation leads to poor kernel performance or launch failures

extensions/thunder/unsloth/kernels/utils.py:calculate_settings
critical Contract unguarded

Triton kernel expects logits_ptr, labels_ptr, loss_ptr, and logsumexp_ptr to point to valid memory regions with correct stride patterns, but performs no bounds checking

If this fails: Invalid pointers or stride mismatches cause segmentation faults or silent memory corruption during kernel execution, difficult to debug in distributed settings

extensions/thunder/unsloth/kernels/cross_entropy_loss.py:_cross_entropy_forward
warning Ordering weakly guarded

SwiGLU kernel processes elements in BLOCK_SIZE chunks assuming contiguous memory layout, with mask logic depending on elements being processed in ascending offset order

If this fails: Non-contiguous tensors or unexpected memory layouts cause incorrect masking behavior, leading to wrong gradient computations during backward pass

extensions/thunder/unsloth/kernels/swiglu.py:_fg_kernel
warning Environment weakly guarded

Thunder framework is available and properly installed when ThunderDDPStrategy is imported, with no graceful degradation if Thunder is missing

If this fails: Import failures cascade through the strategy system causing training scripts to crash with ImportError rather than falling back to standard DDP

extensions/thunder/strategies/thunder_ddp.py:_THUNDER_AVAILABLE import

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model checkpoint storage (checkpoint)
Persisted model weights, optimizer states, and training metadata organized by model architecture and training run
KV cache buffers (cache)
Cached key-value tensors from previous tokens during generation to avoid recomputing attention for efficiency
Training state registry (state-store)
Current training step, epoch, learning rate schedule state, and optimizer momentum buffers maintained across training iterations

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch Lightning (framework)
Provides distributed training orchestration, device management, and training loop abstractions via Fabric
PyTorch (framework)
Core tensor operations, autograd, and neural network primitives for model implementation and training
Hugging Face Hub (library)
Downloads pretrained model checkpoints and handles model metadata and configuration files
Tokenizers (library)
Fast tokenization using Rust-based implementations for various model vocabularies (BPE, SentencePiece)
SafeTensors (serialization)
Secure and efficient serialization format for model weights with memory mapping support
Thunder (compute)
PyTorch compiler for optimizing model execution with fusion, memory optimization, and hardware acceleration
Triton (compute)
GPU kernel language for custom CUDA operations like optimized attention and cross-entropy implementations
LitServe (infra)
High-performance inference serving with batching, streaming, and API endpoint management

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare litgpt

Related Ml Training Repositories

Frequently Asked Questions

What is litgpt used for?

Provides high-performance implementations of 20+ LLMs with training, fine-tuning, and deployment workflows lightning-ai/litgpt is a 8-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 137 files.

How is litgpt architected?

litgpt is organized into 5 architecture layers: Core Models, Training Workflows, Data Processing, Generation & Deployment, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through litgpt?

Data moves through 6 stages: Dataset tokenization → Model forward pass → Loss computation → Gradient computation and update → Autoregressive generation → .... Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies. This pipeline design reflects a complex multi-stage processing system.

What technologies does litgpt use?

The core stack includes PyTorch Lightning (Provides distributed training orchestration, device management, and training loop abstractions via Fabric), PyTorch (Core tensor operations, autograd, and neural network primitives for model implementation and training), Hugging Face Hub (Downloads pretrained model checkpoints and handles model metadata and configuration files), Tokenizers (Fast tokenization using Rust-based implementations for various model vocabularies (BPE, SentencePiece)), SafeTensors (Secure and efficient serialization format for model weights with memory mapping support), Thunder (PyTorch compiler for optimizing model execution with fusion, memory optimization, and hardware acceleration), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does litgpt have?

litgpt exhibits 3 data pools (Model checkpoint storage, KV cache buffers), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does litgpt use?

5 design patterns detected: Parameter-efficient fine-tuning, Modular workflow dispatch, Lazy model initialization, Chunked cross-entropy, Extension acceleration.

How does litgpt compare to alternatives?

CodeSea has side-by-side architecture comparisons of litgpt with nanogpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by .