karpathy/nanogpt

The simplest, fastest repository for training/finetuning medium-sized GPTs.

56,903 stars Python 9 components

Trains GPT transformer models from scratch or finetunes pretrained ones

Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation.

Under the hood, the system uses 3 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.

A 9-component ml training. 15 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation.

  1. Preprocess text data into tokens — Data preparation scripts download raw text, tokenize it using tiktoken BPE or character mapping, and save token IDs as binary files for efficient loading (config: dataset)
  2. Sample training batches — get_batch() randomly selects starting positions in the tokenized data and extracts sequences of block_size tokens, creating input-target pairs shifted by one position [TokenizedDataset → TokenBatch] (config: batch_size, block_size)
  3. Forward pass through transformer — GPT model embeds input tokens, processes them through n_layer transformer blocks with causal self-attention and feedforward layers, then projects to vocabulary logits [TokenBatch → LogitTensor] (config: n_layer, n_head, n_embd +2)
  4. Compute cross-entropy loss — PyTorch cross_entropy function compares model logits against target tokens, computing the negative log-likelihood averaged across batch and sequence dimensions [LogitTensor]
  5. Backward pass and optimization — Loss.backward() computes gradients, which accumulate across gradient_accumulation_steps before AdamW optimizer updates parameters with learning rate scheduling and weight decay (config: learning_rate, weight_decay, gradient_accumulation_steps +2)
  6. Evaluate and checkpoint — estimate_loss() periodically evaluates on train/val splits, and checkpoint_manager saves model state when validation loss improves or at regular intervals (config: eval_interval, always_save_checkpoint)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TokenBatch train.py
tuple of (x: Tensor[B, T], y: Tensor[B, T]) where x is input token ids and y is target token ids shifted by one position
Created by get_batch() from memory-mapped binary files, consumed by model forward pass to compute loss
GPTConfig model.py
dataclass with block_size: int, vocab_size: int, n_layer: int, n_head: int, n_embd: int, dropout: float, bias: bool
Configured via config files or command line, used to instantiate GPT model, stored in checkpoints for reproducibility
ModelCheckpoint train.py
dict with keys: model (state_dict), optimizer (state_dict), model_args (GPTConfig), iter_num (int), best_val_loss (float), config (dict)
Saved periodically during training to enable resumption, loaded at startup to continue from previous state
TokenizedDataset data/*/prepare.py
numpy memmap of uint16 token ids stored in train.bin and val.bin files
Created once by preprocessing scripts that tokenize raw text, then memory-mapped during training for efficient random access
LogitTensor model.py
Tensor[B, T, vocab_size] representing unnormalized token probabilities for each position
Generated by model forward pass, used to compute cross-entropy loss against target tokens or converted to probabilities for sampling

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

Memory-mapped data files contain enough tokens that random sampling with ix = torch.randint(len(data) - block_size, (batch_size,)) never goes out of bounds, specifically that len(data) > block_size

If this fails: If a dataset has fewer tokens than block_size (1024 by default), torch.randint generates negative upper bounds causing crashes or invalid indices that corrupt training batches

train.py:get_batch
critical Domain weakly guarded

config.n_embd is always divisible by config.n_head with assertion but never validates that both are positive integers

If this fails: If n_embd or n_head are zero, negative, or non-integer (from config file corruption), the assertion passes but linear layers get invalid dimensions causing silent NaN gradients

model.py:CausalSelfAttention.__init__
critical Environment unguarded

Config files and command-line arguments contain only safe Python code since configurator.py uses exec() without sandboxing

If this fails: Malicious config files can execute arbitrary code with full Python access - rm -rf /, network calls, data exfiltration - creating a remote code execution vulnerability

configurator.py:exec
critical Resource unguarded

GPU memory can hold batch_size * block_size tensors (12 * 1024 = 12,288 tokens by default) plus model parameters without checking available VRAM

If this fails: Large models or small GPUs cause CUDA out-of-memory errors that crash training without graceful degradation or helpful error messages about required vs available memory

bench.py:get_batch
critical Contract unguarded

Checkpoint files contain a 'model_args' key with GPTConfig-compatible parameters and that model architecture hasn't changed between save and load

If this fails: Loading checkpoints from different model versions or corrupted files fails silently or loads incompatible weights into wrong layers, leading to degraded performance that's hard to diagnose

train.py:checkpoint loading
warning Scale unguarded

Input sequence length never exceeds the block_size limit set during model initialization, relying on data preprocessing to ensure this

If this fails: Sequences longer than block_size cause attention mask mismatches and index errors in the causal mask, breaking the transformer's autoregressive property

model.py:forward
warning Ordering unguarded

Model is in eval() mode during validation and gets switched back to train() mode, but this isn't enforced if estimate_loss() is called directly

If this fails: Using training mode during validation applies dropout and batch normalization updates to test data, giving inaccurate loss estimates that mislead checkpoint selection

train.py:estimate_loss
warning Temporal unguarded

The tiktoken tokenizer used for sampling exactly matches the tokenizer used during training, but this isn't verified

If this fails: Tokenizer version mismatches cause vocabulary differences where tokens get encoded/decoded incorrectly, producing garbled text output that's hard to trace back to the encoding mismatch

sample.py:tiktoken encoding
warning Environment unguarded

PyTorch 2.0 torch.compile() works correctly with the specific model architecture and CUDA version, defaulting compile=True

If this fails: Compilation failures on unsupported hardware or PyTorch versions cause cryptic errors or silent performance degradation, making it unclear whether to disable compilation

train.py:torch.compile
warning Domain unguarded

Temperature parameter is positive (0.8 default) but negative values aren't prevented, which would invert probability distributions

If this fails: Negative temperature causes the model to strongly prefer low-probability tokens, generating nonsensical text that looks like successful sampling but with inverted semantics

sample.py:temperature scaling

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

tokenized_datasets (file-store)
Memory-mapped binary files containing preprocessed token sequences for efficient random access during training
model_checkpoints (checkpoint)
Saved training state including model weights, optimizer state, configuration, and training progress for resumption
gradient_accumulator (buffer)
Accumulates gradients across multiple micro-batches before applying optimizer step to simulate larger batch sizes

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Provides tensor operations, automatic differentiation, neural network layers, and distributed training support
tiktoken (library)
Fast BPE tokenization compatible with GPT-2 for converting text to token sequences
transformers (library)
Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning
datasets (library)
Downloads and processes large text corpora like OpenWebText
wandb (infra)
Logs training metrics, loss curves, and hyperparameters for experiment tracking
numpy (library)
Handles binary data serialization and memory-mapped file access for efficient dataset loading

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare nanoGPT

Related Ml Training Repositories

Frequently Asked Questions

What is nanoGPT used for?

Trains GPT transformer models from scratch or finetunes pretrained ones karpathy/nanogpt is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 15 files.

How is nanoGPT architected?

nanoGPT is organized into 5 architecture layers: Training orchestration, Model architecture, Data pipeline, Configuration system, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through nanoGPT?

Data moves through 6 stages: Preprocess text data into tokens → Sample training batches → Forward pass through transformer → Compute cross-entropy loss → Backward pass and optimization → .... Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation. This pipeline design reflects a complex multi-stage processing system.

What technologies does nanoGPT use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, neural network layers, and distributed training support), tiktoken (Fast BPE tokenization compatible with GPT-2 for converting text to token sequences), transformers (Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning), datasets (Downloads and processes large text corpora like OpenWebText), wandb (Logs training metrics, loss curves, and hyperparameters for experiment tracking), numpy (Handles binary data serialization and memory-mapped file access for efficient dataset loading). A focused set of dependencies that keeps the build manageable.

What system dynamics does nanoGPT have?

nanoGPT exhibits 3 data pools (tokenized_datasets, model_checkpoints), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does nanoGPT use?

4 design patterns detected: Configuration by execution, Memory-mapped data loading, Gradient accumulation, Mixed precision training.

How does nanoGPT compare to alternatives?

CodeSea has side-by-side architecture comparisons of nanoGPT with mingpt, litgpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by .