karpathy/nanogpt

Q: How is nanoGPT architected?

nanoGPT is organized into 5 architecture layers: Training orchestration, Model architecture, Data pipeline, Configuration system, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

Q: How does data flow through nanoGPT?

Data moves through 6 stages: Preprocess text data into tokens → Sample training batches → Forward pass through transformer → Compute cross-entropy loss → Backward pass and optimization → .... Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation. This pipeline design reflects a complex multi-stage processing system.

Q: What technologies does nanoGPT use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, neural network layers, and distributed training support), tiktoken (Fast BPE tokenization compatible with GPT-2 for converting text to token sequences), transformers (Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning), datasets (Downloads and processes large text corpora like OpenWebText), wandb (Logs training metrics, loss curves, and hyperparameters for experiment tracking), numpy (Handles binary data serialization and memory-mapped file access for efficient dataset loading). A focused set of dependencies that keeps the build manageable.

Q: What system dynamics does nanoGPT have?

nanoGPT exhibits 3 data pools (tokenized_datasets, model_checkpoints), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

Q: What design patterns does nanoGPT use?

4 design patterns detected: Configuration by execution, Memory-mapped data loading, Gradient accumulation, Mixed precision training.

Q: How does nanoGPT compare to alternatives?

CodeSea has side-by-side architecture comparisons of nanoGPT with mingpt, litgpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

The simplest, fastest repository for training/finetuning medium-sized GPTs.

56,903 stars Python 9 components

Trains GPT transformer models from scratch or finetunes pretrained ones

Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation.

Under the hood, the system uses 3 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.

A 9-component ml training. 15 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Preprocess text data into tokens — Data preparation scripts download raw text, tokenize it using tiktoken BPE or character mapping, and save token IDs as binary files for efficient loading (config: dataset)
Sample training batches — get_batch() randomly selects starting positions in the tokenized data and extracts sequences of block_size tokens, creating input-target pairs shifted by one position [TokenizedDataset → TokenBatch] (config: batch_size, block_size)
Forward pass through transformer — GPT model embeds input tokens, processes them through n_layer transformer blocks with causal self-attention and feedforward layers, then projects to vocabulary logits [TokenBatch → LogitTensor] (config: n_layer, n_head, n_embd +2)
Compute cross-entropy loss — PyTorch cross_entropy function compares model logits against target tokens, computing the negative log-likelihood averaged across batch and sequence dimensions [LogitTensor]
Backward pass and optimization — Loss.backward() computes gradients, which accumulate across gradient_accumulation_steps before AdamW optimizer updates parameters with learning rate scheduling and weight decay (config: learning_rate, weight_decay, gradient_accumulation_steps +2)
Evaluate and checkpoint — estimate_loss() periodically evaluates on train/val splits, and checkpoint_manager saves model state when validation loss improves or at regular intervals (config: eval_interval, always_save_checkpoint)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TokenBatch train.py
tuple of (x: Tensor[B, T], y: Tensor[B, T]) where x is input token ids and y is target token ids shifted by one position
Created by get_batch() from memory-mapped binary files, consumed by model forward pass to compute loss

GPTConfig model.py
dataclass with block_size: int, vocab_size: int, n_layer: int, n_head: int, n_embd: int, dropout: float, bias: bool
Configured via config files or command line, used to instantiate GPT model, stored in checkpoints for reproducibility

ModelCheckpoint train.py
dict with keys: model (state_dict), optimizer (state_dict), model_args (GPTConfig), iter_num (int), best_val_loss (float), config (dict)
Saved periodically during training to enable resumption, loaded at startup to continue from previous state

TokenizedDataset data/*/prepare.py
numpy memmap of uint16 token ids stored in train.bin and val.bin files
Created once by preprocessing scripts that tokenize raw text, then memory-mapped during training for efficient random access

LogitTensor model.py
Tensor[B, T, vocab_size] representing unnormalized token probabilities for each position
Generated by model forward pass, used to compute cross-entropy loss against target tokens or converted to probabilities for sampling

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

Memory-mapped data files contain enough tokens that random sampling with ix = torch.randint(len(data) - block_size, (batch_size,)) never goes out of bounds, specifically that len(data) > block_size

If this fails: If a dataset has fewer tokens than block_size (1024 by default), torch.randint generates negative upper bounds causing crashes or invalid indices that corrupt training batches

train.py:get_batch

critical Domain weakly guarded

config.n_embd is always divisible by config.n_head with assertion but never validates that both are positive integers

If this fails: If n_embd or n_head are zero, negative, or non-integer (from config file corruption), the assertion passes but linear layers get invalid dimensions causing silent NaN gradients

model.py:CausalSelfAttention.__init__

critical Environment unguarded

Config files and command-line arguments contain only safe Python code since configurator.py uses exec() without sandboxing

If this fails: Malicious config files can execute arbitrary code with full Python access - rm -rf /, network calls, data exfiltration - creating a remote code execution vulnerability

configurator.py:exec

critical Resource unguarded

GPU memory can hold batch_size * block_size tensors (12 * 1024 = 12,288 tokens by default) plus model parameters without checking available VRAM

If this fails: Large models or small GPUs cause CUDA out-of-memory errors that crash training without graceful degradation or helpful error messages about required vs available memory

bench.py:get_batch

critical Contract unguarded

Checkpoint files contain a 'model_args' key with GPTConfig-compatible parameters and that model architecture hasn't changed between save and load

If this fails: Loading checkpoints from different model versions or corrupted files fails silently or loads incompatible weights into wrong layers, leading to degraded performance that's hard to diagnose

train.py:checkpoint loading

warning Scale unguarded

Input sequence length never exceeds the block_size limit set during model initialization, relying on data preprocessing to ensure this

If this fails: Sequences longer than block_size cause attention mask mismatches and index errors in the causal mask, breaking the transformer's autoregressive property

model.py:forward

warning Ordering unguarded

Model is in eval() mode during validation and gets switched back to train() mode, but this isn't enforced if estimate_loss() is called directly

If this fails: Using training mode during validation applies dropout and batch normalization updates to test data, giving inaccurate loss estimates that mislead checkpoint selection

train.py:estimate_loss

warning Temporal unguarded

The tiktoken tokenizer used for sampling exactly matches the tokenizer used during training, but this isn't verified

If this fails: Tokenizer version mismatches cause vocabulary differences where tokens get encoded/decoded incorrectly, producing garbled text output that's hard to trace back to the encoding mismatch

sample.py:tiktoken encoding

warning Environment unguarded

PyTorch 2.0 torch.compile() works correctly with the specific model architecture and CUDA version, defaulting compile=True

If this fails: Compilation failures on unsupported hardware or PyTorch versions cause cryptic errors or silent performance degradation, making it unclear whether to disable compilation

train.py:torch.compile

warning Domain unguarded

Temperature parameter is positive (0.8 default) but negative values aren't prevented, which would invert probability distributions

If this fails: Negative temperature causes the model to strongly prefer low-probability tokens, generating nonsensical text that looks like successful sampling but with inverted semantics

sample.py:temperature scaling

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

tokenized_datasets (file-store)
Memory-mapped binary files containing preprocessed token sequences for efficient random access during training

model_checkpoints (checkpoint)
Saved training state including model weights, optimizer state, configuration, and training progress for resumption

gradient_accumulator (buffer)
Accumulates gradients across multiple micro-batches before applying optimizer step to simulate larger batch sizes

Feedback Loops

training_loop (training-loop, reinforcing) — Trigger: iter_num < max_iters. Action: Sample batch, forward pass, compute loss, backward pass, accumulate gradients, optimizer step. Exit: max_iters reached or manual termination.
learning_rate_decay (convergence, balancing) — Trigger: iter_num progress. Action: Cosine annealing reduces learning rate from initial value to min_lr over lr_decay_iters. Exit: min_lr reached.
gradient_accumulation (gradient-accumulation, reinforcing) — Trigger: micro_step < gradient_accumulation_steps. Action: Accumulate gradients without optimizer step. Exit: accumulation steps reached, then optimizer.step().

Delays

checkpoint_saving (checkpoint-save, ~seconds) — Training pauses while model state is serialized to disk
validation_evaluation (async-processing, ~eval_iters * forward_time) — Training loop pauses to compute validation loss over eval_iters batches
model_compilation (compilation, ~first forward pass) — PyTorch 2.0 torch.compile() optimizes model on first use, causing initial slowdown

Control Points

learning_rate (hyperparameter) — Controls: Optimizer step size affecting convergence speed and stability. Default: varies by config (1e-3 to 6e-4)
gradient_accumulation_steps (architecture-switch) — Controls: Effective batch size by accumulating gradients across multiple micro-batches. Default: varies (1 to 40)
compile (runtime-toggle) — Controls: Whether to use PyTorch 2.0 compilation for performance optimization. Default: True
dtype (precision-mode) — Controls: Numeric precision (float32, bfloat16, float16) affecting memory usage and performance. Default: bfloat16 if supported
init_from (architecture-switch) — Controls: Whether to train from scratch or finetune from pretrained checkpoints (gpt2, gpt2-medium, etc.). Default: scratch or gpt2 variants

Technology Stack

PyTorch (framework)
Provides tensor operations, automatic differentiation, neural network layers, and distributed training support

tiktoken (library)
Fast BPE tokenization compatible with GPT-2 for converting text to token sequences

transformers (library)
Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning

datasets (library)
Downloads and processes large text corpora like OpenWebText

wandb (infra)
Logs training metrics, loss curves, and hyperparameters for experiment tracking

numpy (library)
Handles binary data serialization and memory-mapped file access for efficient dataset loading

Key Components

GPT (transformer) — Implements the decoder-only transformer architecture with causal self-attention, computing logits for next-token prediction model.py
get_batch (loader) — Samples random sequences from tokenized dataset files, creating input-target pairs for training train.py
estimate_loss (validator) — Evaluates model performance on train/validation splits by computing average loss over multiple batches train.py
CausalSelfAttention (processor) — Computes self-attention with causal masking to prevent information leakage from future tokens model.py
configurator (resolver) — Dynamically loads configuration files and command-line overrides into the global namespace configurator.py
AdamW optimizer (optimizer) — Updates model parameters using gradient descent with weight decay and learning rate scheduling train.py
tokenizer (encoder) — Converts raw text to token IDs using either tiktoken GPT-2 BPE encoding or character-level mapping data/*/prepare.py
checkpoint_manager (store) — Saves and loads training state including model weights, optimizer state, and training progress train.py
sample_generator (decoder) — Generates text by iteratively sampling from model logits using temperature scaling and top-k filtering sample.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare nanoGPT

Related Ml Training Repositories

Frequently Asked Questions

What is nanoGPT used for?

Trains GPT transformer models from scratch or finetunes pretrained ones karpathy/nanogpt is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 15 files.

How is nanoGPT architected?

nanoGPT is organized into 5 architecture layers: Training orchestration, Model architecture, Data pipeline, Configuration system, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through nanoGPT?

Data moves through 6 stages: Preprocess text data into tokens → Sample training batches → Forward pass through transformer → Compute cross-entropy loss → Backward pass and optimization → .... Raw text datasets are preprocessed into tokenized binary files using either GPT-2 BPE encoding or character mappings. During training, random sequences are sampled from these files to create input-target pairs. The GPT model processes input tokens through transformer layers to produce logits, which are used to compute cross-entropy loss against targets. Gradients flow backward to update parameters via AdamW optimization with learning rate scheduling and gradient accumulation. This pipeline design reflects a complex multi-stage processing system.

What technologies does nanoGPT use?

The core stack includes PyTorch (Provides tensor operations, automatic differentiation, neural network layers, and distributed training support), tiktoken (Fast BPE tokenization compatible with GPT-2 for converting text to token sequences), transformers (Loads pretrained GPT-2 checkpoints from Hugging Face for finetuning), datasets (Downloads and processes large text corpora like OpenWebText), wandb (Logs training metrics, loss curves, and hyperparameters for experiment tracking), numpy (Handles binary data serialization and memory-mapped file access for efficient dataset loading). A focused set of dependencies that keeps the build manageable.

What system dynamics does nanoGPT have?

nanoGPT exhibits 3 data pools (tokenized_datasets, model_checkpoints), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does nanoGPT use?

4 design patterns detected: Configuration by execution, Memory-mapped data loading, Gradient accumulation, Mixed precision training.

How does nanoGPT compare to alternatives?

CodeSea has side-by-side architecture comparisons of nanoGPT with mingpt, litgpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

karpathy/nanogpt

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Explore the interactive analysis

Compare nanoGPT

nanoGPT vs Mingpt

nanoGPT vs Litgpt

Related Ml Training Repositories

tensorflow/tensorflow

automatic1111/stable-diffusion-webui

huggingface/transformers

ggml-org/llama.cpp

pytorch/pytorch

openai/whisper

Frequently Asked Questions