lightning-ai/litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

13,308 stars Python 8 components

Provides high-performance implementations of 20+ LLMs with training, fine-tuning, and deployment workflows

Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies.

Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 8-component ml training. 137 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Dataset tokenization — Raw text files are loaded by DataModule, split into chunks matching model block_size, and converted to token IDs using model-specific vocabulary via Tokenizer.encode() (config: block_size, vocab_size)
Model forward pass — GPT.forward() processes input_ids through embedding layers, n_layer transformer blocks with attention and MLP, applying layer normalization and computing logits over vocabulary [TokenizedBatch → Raw logits tensor] (config: n_layer, n_embd, n_head)
Loss computation — chunked_cross_entropy() computes cross-entropy loss between predicted logits and target labels, with ignore_index masking for padding tokens [Raw logits tensor → TrainingMetrics]
Gradient computation and update — TrainingLoop calls loss.backward() to compute gradients, applies gradient clipping and accumulation, then optimizer.step() updates model parameters with learning rate scheduling [TrainingMetrics → Updated ModelState] (config: learning_rate, batch_size)
Autoregressive generation — GenerationEngine.generate() performs iterative token prediction, sampling next tokens using temperature and top_k/top_p filtering, maintaining KV cache for efficiency [GenerationConfig → Generated text tokens] (config: max_new_tokens, temperature, top_k)
Checkpoint persistence — CheckpointIO serializes ModelState including transformer weights and optimizer states to disk, with optional sharded saving for large models [ModelState]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Config litgpt/config.py
Dataclass with block_size: int (4096), n_layer: int (16), n_embd: int (4096), vocab_size: int (50254), attention params, normalization settings, and architecture-specific parameters
Created from model config files or defaults, passed to model constructors, serialized with checkpoints

TokenizedBatch litgpt/data/__init__.py
Dict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len] where B is batch size and seq_len is sequence length
Created by tokenizing text samples, batched by DataLoader, consumed by model forward pass

ModelState litgpt/model.py
OrderedDict containing transformer.wte.weight: Tensor[vocab_size, n_embd], transformer.blocks.*.attn.weight: Tensor[...], lm_head.weight: Tensor[vocab_size, n_embd], and optimizer states
Extracted from trained models, saved to disk, loaded for inference or continued training

GenerationConfig litgpt/generate/base.py
Dict with max_new_tokens: int, temperature: float, top_k: int, top_p: float, eos_token_id: int controlling generation behavior
Configured by user parameters, used during autoregressive generation to control token sampling

TrainingMetrics litgpt/pretrain.py
Dict with loss: float, learning_rate: float, throughput: float, memory_usage: float, gradient_norm: float tracked during training
Computed each training step, aggregated for logging, used for training diagnostics

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment weakly guarded

Triton kernels can be launched with block sizes up to 65536 (MAX_FUSED_SIZE) and assumes CUDA hardware supports this blocksize limit

If this fails: If running on older CUDA hardware with smaller maximum block sizes, kernel launch will fail with cryptic CUDA errors rather than the documented RuntimeError

extensions/thunder/unsloth/kernels/utils.py:calculate_settings

critical Shape unguarded

Input logits tensor has at least 1 dimension (logits.shape[0] exists) and returns loss tensor with shape (logits.shape[0],) without validating logits tensor rank

If this fails: If logits is a scalar tensor (0-dimensional), accessing logits.shape[0] will raise IndexError, causing meta-function to crash during Thunder compilation

extensions/thunder/unsloth/executor.py:unsloth_cross_entropy_meta

warning Domain unguarded

Cross entropy kernel only supports float32 dtype and hardcodes output dtype as thunder.dtypes.float32 regardless of input logits dtype

If this fails: If model uses bf16 or fp16 precision, silent dtype conversion occurs during loss computation potentially causing gradient scaling issues or numerical instability

extensions/thunder/unsloth/executor.py:unsloth_cross_entropy_meta

warning Environment unguarded

The parent directory structure exists (Path(__file__).parent.parent resolves successfully) and sys.path modification affects global Python import resolution

If this fails: If extension is moved or filesystem structure changes, imports fail with ModuleNotFoundError, and sys.path pollution can cause unexpected import behavior in other parts of the system

extensions/thunder/__init__.py:sys.path modification

critical Contract weakly guarded

RoPE embedding kernel expects Q, cos, and sin tensors to have compatible shapes where head_dim matches the last dimension, but kernel only validates through Triton's memory access patterns

If this fails: Shape mismatches between query tensor and position embeddings cause silent memory access violations or incorrect rotary position computations without explicit error messages

extensions/thunder/unsloth/kernels/rope_embedding.py:_rope_embedding

warning Scale unguarded

RoPE operations process attention heads in groups of exactly 4 (ROPE_GROUP_SIZE = 4), assuming head dimensions are divisible by 4

If this fails: Models with head dimensions not divisible by 4 will have incomplete rotary position encoding applied to the remaining dimensions, leading to degraded attention quality

extensions/thunder/unsloth/kernels/rope_embedding.py:ROPE_GROUP_SIZE

info Resource unguarded

Triton warp allocation logic assumes CUDA GPU execution and calculates num_warps based on BLOCK_SIZE using hardcoded thresholds (32768, 8192, 2048)

If this fails: On non-CUDA devices or GPUs with different warp architectures, suboptimal warp allocation leads to poor kernel performance or launch failures

extensions/thunder/unsloth/kernels/utils.py:calculate_settings

critical Contract unguarded

Triton kernel expects logits_ptr, labels_ptr, loss_ptr, and logsumexp_ptr to point to valid memory regions with correct stride patterns, but performs no bounds checking

If this fails: Invalid pointers or stride mismatches cause segmentation faults or silent memory corruption during kernel execution, difficult to debug in distributed settings

extensions/thunder/unsloth/kernels/cross_entropy_loss.py:_cross_entropy_forward

warning Ordering weakly guarded

SwiGLU kernel processes elements in BLOCK_SIZE chunks assuming contiguous memory layout, with mask logic depending on elements being processed in ascending offset order

If this fails: Non-contiguous tensors or unexpected memory layouts cause incorrect masking behavior, leading to wrong gradient computations during backward pass

extensions/thunder/unsloth/kernels/swiglu.py:_fg_kernel

warning Environment weakly guarded

Thunder framework is available and properly installed when ThunderDDPStrategy is imported, with no graceful degradation if Thunder is missing

If this fails: Import failures cascade through the strategy system causing training scripts to crash with ImportError rather than falling back to standard DDP

extensions/thunder/strategies/thunder_ddp.py:_THUNDER_AVAILABLE import

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model checkpoint storage (checkpoint)
Persisted model weights, optimizer states, and training metadata organized by model architecture and training run

KV cache buffers (cache)
Cached key-value tensors from previous tokens during generation to avoid recomputing attention for efficiency

Training state registry (state-store)
Current training step, epoch, learning rate schedule state, and optimizer momentum buffers maintained across training iterations

Feedback Loops

Training optimization loop (training-loop, reinforcing) — Trigger: Next batch from DataLoader. Action: Forward pass → loss computation → backward pass → parameter update. Exit: Configured max_steps or max_epochs reached.
Generation sampling loop (recursive, balancing) — Trigger: Current sequence shorter than max_new_tokens and not EOS token. Action: Model predicts next token logits → sampling → append to sequence → update KV cache. Exit: EOS token generated or max_new_tokens reached.
Learning rate scheduling (convergence, balancing) — Trigger: Training step completion. Action: Update learning rate based on schedule (cosine, linear decay) and current step. Exit: Training completion.

Delays

Model compilation (compilation, ~First forward pass) — Initial latency spike when using torch.compile or JIT compilation
Distributed synchronization (async-processing, ~Per gradient sync) — Gradient synchronization across ranks during distributed training
Checkpoint saving (checkpoint-save, ~Configurable save_interval) — Training pauses briefly to serialize model state to disk

Control Points

Model architecture selection (architecture-switch) — Controls: Which transformer variant (GPT, Llama, Falcon) and layer configuration to use. Default: Determined by config.name field
Precision mode (precision-mode) — Controls: Floating point precision (fp32, fp16, bf16) affecting memory usage and computation speed. Default: bf16-mixed by default
Device strategy (device-selection) — Controls: Whether to use single GPU, DDP, FSDP, or TPU acceleration. Default: Auto-detected based on available hardware
Generation sampling strategy (sampling-strategy) — Controls: Token sampling behavior via temperature, top_k, top_p parameters. Default: temperature=0.8, top_k=50 by default

Technology Stack

PyTorch Lightning (framework)
Provides distributed training orchestration, device management, and training loop abstractions via Fabric

PyTorch (framework)
Core tensor operations, autograd, and neural network primitives for model implementation and training

Hugging Face Hub (library)
Downloads pretrained model checkpoints and handles model metadata and configuration files

Tokenizers (library)
Fast tokenization using Rust-based implementations for various model vocabularies (BPE, SentencePiece)

SafeTensors (serialization)
Secure and efficient serialization format for model weights with memory mapping support

Thunder (compute)
PyTorch compiler for optimizing model execution with fusion, memory optimization, and hardware acceleration

Triton (compute)
GPU kernel language for custom CUDA operations like optimized attention and cross-entropy implementations

LitServe (infra)
High-performance inference serving with batching, streaming, and API endpoint management

Key Components

GPT (transformer) — Core transformer model that processes input token sequences through stacked attention blocks to predict next tokens litgpt/model.py
DataModule (loader) — Orchestrates dataset loading, tokenization, and batching with configurable data sources and preprocessing pipelines litgpt/data/__init__.py
TrainingLoop (orchestrator) — Coordinates model training with gradient computation, parameter updates, checkpointing, and distributed synchronization litgpt/pretrain.py
GenerationEngine (processor) — Implements autoregressive text generation with sampling strategies, caching, and various decoding algorithms litgpt/generate/base.py
Tokenizer (encoder) — Converts between raw text and token sequences using model-specific vocabularies and special token handling litgpt/tokenizer.py
CheckpointIO (serializer) — Handles saving and loading model states with validation, format conversion, and distributed checkpointing support litgpt/utils.py
AdapterWrapper (adapter) — Wraps base models with parameter-efficient fine-tuning layers that add learnable prompts to attention blocks litgpt/adapter.py
LitServeDeployment (gateway) — Provides HTTP API endpoints for model inference with request queuing, batching, and response streaming litgpt/deploy/serve.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare litgpt

Related Ml Training Repositories

Frequently Asked Questions

What is litgpt used for?

Provides high-performance implementations of 20+ LLMs with training, fine-tuning, and deployment workflows lightning-ai/litgpt is a 8-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 137 files.

How is litgpt architected?

litgpt is organized into 5 architecture layers: Core Models, Training Workflows, Data Processing, Generation & Deployment, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through litgpt?

Data moves through 6 stages: Dataset tokenization → Model forward pass → Loss computation → Gradient computation and update → Autoregressive generation → .... Data enters through text datasets or user prompts, gets tokenized into sequences, flows through transformer blocks for training or generation, and outputs either model checkpoints or generated text. Training involves forward passes computing cross-entropy loss, backward passes for gradient computation, and optimizer updates. Generation uses cached key-value states for efficient autoregressive decoding with configurable sampling strategies. This pipeline design reflects a complex multi-stage processing system.

What technologies does litgpt use?

The core stack includes PyTorch Lightning (Provides distributed training orchestration, device management, and training loop abstractions via Fabric), PyTorch (Core tensor operations, autograd, and neural network primitives for model implementation and training), Hugging Face Hub (Downloads pretrained model checkpoints and handles model metadata and configuration files), Tokenizers (Fast tokenization using Rust-based implementations for various model vocabularies (BPE, SentencePiece)), SafeTensors (Secure and efficient serialization format for model weights with memory mapping support), Thunder (PyTorch compiler for optimizing model execution with fusion, memory optimization, and hardware acceleration), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does litgpt have?

litgpt exhibits 3 data pools (Model checkpoint storage, KV cache buffers), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does litgpt use?

5 design patterns detected: Parameter-efficient fine-tuning, Modular workflow dispatch, Lazy model initialization, Chunked cross-entropy, Extension acceleration.

How does litgpt compare to alternatives?

CodeSea has side-by-side architecture comparisons of litgpt with nanogpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

lightning-ai/litgpt

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Explore the interactive analysis

Compare litgpt

litgpt vs Nanogpt

Related Ml Training Repositories

tensorflow/tensorflow

automatic1111/stable-diffusion-webui

huggingface/transformers

ggml-org/llama.cpp

pytorch/pytorch

openai/whisper

Frequently Asked Questions