karpathy/nanogpt
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Trains GPT transformer models from scratch or finetunes pretrained ones
Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation.
Under the hood, the system uses 3 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.
A 9-component ml training. 15 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation.
- Preprocess text data into tokens — Data preparation scripts download raw text, tokenize it using tiktoken BPE or character mapping, and save token IDs as binary files for efficient loading (config: dataset)
- Sample training batches — get_batch() randomly selects starting positions in the tokenized data and extracts sequences of block_size tokens, creating input-target pairs shifted by one position [TokenizedDataset → TokenBatch] (config: batch_size, block_size)
- Forward pass through transformer — GPT model embeds input tokens, processes them through n_layer transformer blocks with causal self-attention and feedforward layers, then projects to vocabulary logits [TokenBatch → LogitTensor] (config: n_layer, n_head, n_embd +2)
- Compute cross-entropy loss — PyTorch cross_entropy function compares model logits against target tokens, computing the negative log-likelihood averaged across batch and sequence dimensions [LogitTensor]
- Backward pass and optimization — Loss.backward() computes gradients, which accumulate across gradient_accumulation_steps before AdamW optimizer updates parameters with learning rate scheduling and weight decay (config: learning_rate, weight_decay, gradient_accumulation_steps +2)
- Evaluate and checkpoint — estimate_loss() periodically evaluates on train/val splits, and checkpoint_manager saves model state when validation loss improves or at regular intervals (config: eval_interval, always_save_checkpoint)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
train.pytuple of (x: Tensor[B, T], y: Tensor[B, T]) where x is input token ids and y is target token ids shifted by one position
Created by get_batch() from memory-mapped binary files, consumed by model forward pass to compute loss
model.pydataclass with block_size: int, vocab_size: int, n_layer: int, n_head: int, n_embd: int, dropout: float, bias: bool
Configured via config files or command line, used to instantiate GPT model, stored in checkpoints for reproducibility
train.pydict with keys: model (state_dict), optimizer (state_dict), model_args (GPTConfig), iter_num (int), best_val_loss (float), config (dict)
Saved periodically during training to enable resumption, loaded at startup to continue from previous state
data/*/prepare.pynumpy memmap of uint16 token ids stored in train.bin and val.bin files
Created once by preprocessing scripts that tokenize raw text, then memory-mapped during training for efficient random access
model.pyTensor[B, T, vocab_size] representing unnormalized token probabilities for each position
Generated by model forward pass, used to compute cross-entropy loss against target tokens or converted to probabilities for sampling
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Memory-mapped data files contain enough tokens that random sampling with ix = torch.randint(len(data) - block_size, (batch_size,)) never goes out of bounds, specifically that len(data) > block_size
If this fails: If a dataset has fewer tokens than block_size (1024 by default), torch.randint generates negative upper bounds causing crashes or invalid indices that corrupt training batches
train.py:get_batch
config.n_embd is always divisible by config.n_head with assertion but never validates that both are positive integers
If this fails: If n_embd or n_head are zero, negative, or non-integer (from config file corruption), the assertion passes but linear layers get invalid dimensions causing silent NaN gradients
model.py:CausalSelfAttention.__init__
Config files and command-line arguments contain only safe Python code since configurator.py uses exec() without sandboxing
If this fails: Malicious config files can execute arbitrary code with full Python access - rm -rf /, network calls, data exfiltration - creating a remote code execution vulnerability
configurator.py:exec
GPU memory can hold batch_size * block_size tensors (12 * 1024 = 12,288 tokens by default) plus model parameters without checking available VRAM
If this fails: Large models or small GPUs cause CUDA out-of-memory errors that crash training without graceful degradation or helpful error messages about required vs available memory
bench.py:get_batch
Checkpoint files contain a 'model_args' key with GPTConfig-compatible parameters and that model architecture hasn't changed between save and load
If this fails: Loading checkpoints from different model versions or corrupted files fails silently or loads incompatible weights into wrong layers, leading to degraded performance that's hard to diagnose
train.py:checkpoint loading
Input sequence length never exceeds the block_size limit set during model initialization, relying on data preprocessing to ensure this
If this fails: Sequences longer than block_size cause attention mask mismatches and index errors in the causal mask, breaking the transformer's autoregressive property
model.py:forward
Model is in eval() mode during validation and gets switched back to train() mode, but this isn't enforced if estimate_loss() is called directly
If this fails: Using training mode during validation applies dropout and batch normalization updates to test data, giving inaccurate loss estimates that mislead checkpoint selection
train.py:estimate_loss
The tiktoken tokenizer used for sampling exactly matches the tokenizer used during training, but this isn't verified
If this fails: Tokenizer version mismatches cause vocabulary differences where tokens get encoded/decoded incorrectly, producing garbled text output that's hard to trace back to the encoding mismatch
sample.py:tiktoken encoding
PyTorch 2.0 torch.compile() works correctly with the specific model architecture and CUDA version, defaulting compile=True
If this fails: Compilation failures on unsupported hardware or PyTorch versions cause cryptic errors or silent performance degradation, making it unclear whether to disable compilation
train.py:torch.compile
Temperature parameter is positive (0.8 default) but negative values aren't prevented, which would invert probability distributions
If this fails: Negative temperature causes the model to strongly prefer low-probability tokens, generating nonsensical text that looks like successful sampling but with inverted semantics
sample.py:temperature scaling
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Memory-mapped binary files containing preprocessed token sequences for efficient random access during training
Saved training state including model weights, optimizer state, configuration, and training progress for resumption
Accumulates gradients across multiple micro-batches before applying optimizer step to simulate larger batch sizes
Feedback Loops
- training_loop (training-loop, reinforcing) — Trigger: iter_num < max_iters. Action: Sample batch, forward pass, compute loss, backward pass, accumulate gradients, optimizer step. Exit: max_iters reached or manual termination.
- learning_rate_decay (convergence, balancing) — Trigger: iter_num progress. Action: Cosine annealing reduces learning rate from initial value to min_lr over lr_decay_iters. Exit: min_lr reached.
- gradient_accumulation (gradient-accumulation, reinforcing) — Trigger: micro_step < gradient_accumulation_steps. Action: Accumulate gradients without optimizer step. Exit: accumulation steps reached, then optimizer.step().
Delays
- checkpoint_saving (checkpoint-save, ~seconds) — Training pauses while model state is serialized to disk
- validation_evaluation (async-processing, ~eval_iters * forward_time) — Training loop pauses to compute validation loss over eval_iters batches
- model_compilation (compilation, ~first forward pass) — PyTorch 2.0 torch.compile() optimizes model on first use, causing initial slowdown
Control Points
- learning_rate (hyperparameter) — Controls: Optimizer step size affecting convergence speed and stability. Default: varies by config (1e-3 to 6e-4)
- gradient_accumulation_steps (architecture-switch) — Controls: Effective batch size by accumulating gradients across multiple micro-batches. Default: varies (1 to 40)
- compile (runtime-toggle) — Controls: Whether to use PyTorch 2.0 compilation for performance optimization. Default: True
- dtype (precision-mode) — Controls: Numeric precision (float32, bfloat16, float16) affecting memory usage and performance. Default: bfloat16 if supported
- init_from (architecture-switch) — Controls: Whether to train from scratch or finetune from pretrained checkpoints (gpt2, gpt2-medium, etc.). Default: scratch or gpt2 variants
Technology Stack
Provides tensor operations, automatic differentiation, neural network layers, and distributed training support
Fast BPE tokenization compatible with GPT-2 for converting text to token sequences
Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning
Downloads and processes large text corpora like OpenWebText
Logs training metrics, loss curves, and hyperparameters for experiment tracking
Handles binary data serialization and memory-mapped file access for efficient dataset loading
Key Components
- GPT (transformer) — Implements the decoder-only transformer architecture with causal self-attention, computing logits for next-token prediction
model.py - get_batch (loader) — Samples random sequences from tokenized dataset files, creating input-target pairs for training
train.py - estimate_loss (validator) — Evaluates model performance on train/validation splits by computing average loss over multiple batches
train.py - CausalSelfAttention (processor) — Computes self-attention with causal masking to prevent information leakage from future tokens
model.py - configurator (resolver) — Dynamically loads configuration files and command-line overrides into the global namespace
configurator.py - AdamW optimizer (optimizer) — Updates model parameters using gradient descent with weight decay and learning rate scheduling
train.py - tokenizer (encoder) — Converts raw text to token IDs using either tiktoken GPT-2 BPE encoding or character-level mapping
data/*/prepare.py - checkpoint_manager (store) — Saves and loads training state including model weights, optimizer state, and training progress
train.py - sample_generator (decoder) — Generates text by iteratively sampling from model logits using temperature scaling and top-k filtering
sample.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaCompare nanoGPT
Related Ml Training Repositories
Frequently Asked Questions
What is nanoGPT used for?
Trains GPT transformer models from scratch or finetunes pretrained ones karpathy/nanogpt is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 15 files.
How is nanoGPT architected?
nanoGPT is organized into 5 architecture layers: Training orchestration, Model architecture, Data pipeline, Configuration system, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through nanoGPT?
Data moves through 6 stages: Preprocess text data into tokens → Sample training batches → Forward pass through transformer → Compute cross-entropy loss → Backward pass and optimization → .... Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation. This pipeline design reflects a complex multi-stage processing system.
What technologies does nanoGPT use?
The core stack includes PyTorch (Provides tensor operations, automatic differentiation, neural network layers, and distributed training support), tiktoken (Fast BPE tokenization compatible with GPT-2 for converting text to token sequences), transformers (Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning), datasets (Downloads and processes large text corpora like OpenWebText), wandb (Logs training metrics, loss curves, and hyperparameters for experiment tracking), numpy (Handles binary data serialization and memory-mapped file access for efficient dataset loading). A focused set of dependencies that keeps the build manageable.
What system dynamics does nanoGPT have?
nanoGPT exhibits 3 data pools (tokenized_datasets, model_checkpoints), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does nanoGPT use?
4 design patterns detected: Configuration by execution, Memory-mapped data loading, Gradient accumulation, Mixed precision training.
How does nanoGPT compare to alternatives?
CodeSea has side-by-side architecture comparisons of nanoGPT with mingpt, litgpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.