karpathy/mingpt
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
Trains minimal GPT transformers from scratch on character or number sequences
Text enters through either the BPE tokenizer (for real text) or synthetic datasets (for arithmetic/character tasks), gets converted to integer token sequences, then flows through the GPT transformer which applies multiple layers of causal self-attention and feedforward processing to predict the next token in each sequence. During training, these predictions are compared against target tokens using cross-entropy loss, and gradients flow backward through the transformer to update weights via AdamW optimization.
Under the hood, the system uses 2 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
A 7-component ml training. 9 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Text enters through either the BPE tokenizer (for real text) or synthetic datasets (for arithmetic/character tasks), gets converted to integer token sequences, then flows through the GPT transformer which applies multiple layers of causal self-attention and feedforward processing to predict the next token in each sequence. During training, these predictions are compared against target tokens using cross-entropy loss, and gradients flow backward through the transformer to update weights via AdamW optimization.
- Text Tokenization — BPETokenizer.encode() converts raw UTF-8 strings into integer sequences using byte-pair encoding rules, mapping each text chunk to vocabulary indices in the range [0, vocab_size) [Raw text strings → TokenSequence] (config: vocab_size)
- Dataset Batch Loading — DataLoader samples sequences from AdditionDataset or CharDataset, stacks them into batches of shape [batch_size, block_size], and moves tensors to the training device [TokenSequence → TrainingBatch] (config: batch_size, block_size, device)
- Token Embedding — GPT.forward() starts by converting token IDs to dense vectors via nn.Embedding lookup and adds learned positional embeddings for each sequence position [TrainingBatch → Embedded sequences] (config: n_embd, vocab_size)
- Transformer Processing — Each Block applies layer normalization, then CausalSelfAttention computes query-key-value matrices, masks future positions, and aggregates context, followed by a feedforward MLP with GELU activation [Embedded sequences → Contextualized representations] (config: n_layer, n_head, dropout)
- Next Token Prediction — Final linear layer (lm_head) projects the last hidden state to vocabulary size, producing logits that represent unnormalized probabilities for each possible next token [Contextualized representations → LanguageModelLogits] (config: vocab_size)
- Loss Computation — F.cross_entropy compares predicted logits against target tokens (input sequence shifted by one position), computing the negative log-likelihood loss for next-token prediction [LanguageModelLogits → Scalar loss]
- Gradient Update — Trainer.run() calls loss.backward() to compute gradients, applies gradient clipping via torch.nn.utils.clip_grad_norm_, then optimizer.step() updates model parameters using AdamW [Scalar loss → Updated model weights] (config: learning_rate, grad_norm_clip, weight_decay)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
mingpt/bpe.pytorch.LongTensor with shape [sequence_length] containing token IDs from vocabulary range [0, vocab_size)
Created by BPE encoding raw text into integer tokens, fed through transformer layers, and used to compute next-token prediction loss
mingpt/model.pyCfgNode with n_layer: int, n_head: int, n_embd: int, vocab_size: int, block_size: int, dropout: float, bias: bool
Created with default values or loaded from presets like 'gpt2', passed to GPT constructor to define architecture dimensions and behavior
mingpt/trainer.pytorch.LongTensor with shape [batch_size, sequence_length] containing batched token sequences from dataset
Loaded from dataset by DataLoader, split into input/target pairs, fed through model to get logits, used for cross-entropy loss calculation and gradient updates
mingpt/model.pytorch.Tensor with shape [batch_size, n_head, sequence_length, sequence_length] containing attention scores after causal masking and softmax
Computed from query-key dot products, masked to prevent attention to future positions, normalized via softmax, then used to weight value vectors
mingpt/model.pytorch.Tensor with shape [batch_size, sequence_length, vocab_size] containing unnormalized scores for next token prediction
Output from final linear layer after transformer blocks, compared against target tokens for training loss or sampled from for text generation
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
Input tensor x has shape (batch_size, sequence_length, n_embd) and sequence_length <= block_size, but only asserts n_embd % n_head == 0 during initialization
If this fails: If sequence_length > block_size, the causal mask indexing self.bias[:,:,:T,:T] will fail with IndexError since bias buffer is only allocated for block_size positions. If x has wrong shape, matrix multiplications silently produce incorrect attention weights
mingpt/model.py:CausalSelfAttention.forward
Input text is valid UTF-8 and can be encoded to bytes without errors, with no validation of encoding or handling of malformed unicode
If this fails: Malformed UTF-8 strings cause UnicodeEncodeError crashes during text.encode('utf-8'), breaking the entire tokenization pipeline and stopping training
mingpt/bpe.py:encode
GPU memory is sufficient to hold model parameters, optimizer state, and a full training batch simultaneously, with no memory monitoring or batch size adaptation
If this fails: CUDA out-of-memory errors crash training without recovery when batch_size * block_size * n_embd exceeds available VRAM, losing all training progress since last checkpoint
mingpt/trainer.py:run
DataLoader yields batches where each sample x[:-1] serves as input and x[1:] as target, but never validates this sequence alignment contract
If this fails: If dataset returns pre-shifted sequences or wrong shapes, model trains on corrupted input-target pairs, learning meaningless patterns and producing garbage text during generation
mingpt/trainer.py:run
vocab_size fits in available memory for embedding layers (wte: vocab_size * n_embd + lm_head: n_embd * vocab_size parameters), with no size validation
If this fails: Extremely large vocab_size values (e.g., 1M+ tokens) silently allocate gigabytes for embedding matrices, causing memory exhaustion during model initialization before any helpful error message
mingpt/model.py:GPT.__init__
Internet connection exists for downloading encoder.json and vocab.bpe from OpenAI's servers, with no offline fallback or connection validation
If this fails: Network failures or OpenAI URL changes cause requests.get() to hang or fail, breaking tokenizer initialization and preventing any text processing in air-gapped environments
mingpt/bpe.py:get_encoder
Downloaded BPE files (encoder.json, vocab.bpe) remain valid and compatible with the expected format indefinitely, with no version checking or corruption detection
If this fails: If OpenAI updates file formats or files become corrupted during download, json.loads() fails with cryptic parsing errors, or merge operations produce wrong token sequences
mingpt/bpe.py:BPETokenizer.__init__
Generated arithmetic strings always fit within block_size tokens after BPE encoding, but never validates actual encoded length
If this fails: Long addition problems (many digits) get silently truncated by DataLoader collation, causing model to train on incomplete equations and fail on longer problems at test time
projects/adder/adder.py:AdditionDataset.__getitem__
Temperature parameter > 0 for sampling, but allows temperature=0 which causes division by zero in softmax scaling logits/temperature
If this fails: temperature=0 produces NaN logits after division, leading to invalid probability distributions and random token sampling instead of deterministic greedy selection
mingpt/model.py:GPT.generate
grad_norm_clip value is reasonable for the model size (default 1.0), but never adapts to actual gradient magnitudes or provides feedback on clipping frequency
If this fails: Too aggressive clipping (small clip value with large model) prevents learning by constantly truncating gradients, while too loose clipping allows gradient explosions, both silently degrading training
mingpt/trainer.py:run
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Learned transformer weights including token embeddings, attention projections, and feedforward layers that accumulate gradients during training
Periodic snapshots of model state and optimizer state saved to disk during training for resumption and evaluation
Loaded merge rules and vocabulary mappings stored in memory to avoid re-downloading OpenAI's tokenizer files on every instantiation
Feedback Loops
- Training Loop (training-loop, reinforcing) — Trigger: Trainer.run() iterates over DataLoader batches. Action: Each iteration computes forward pass, loss, backward pass, and parameter update via optimizer.step(). Exit: Reaches max_iters limit or manual interruption.
- Learning Rate Scheduling (convergence, balancing) — Trigger: Trainer callback system after each training step. Action: Adjusts learning rate based on training progress or validation metrics. Exit: Training completion.
Delays
- Gradient Accumulation (batch-window, ~accumulation_steps iterations) — Gradients accumulate across mini-batches before optimizer.step() applies the combined update
- Checkpoint Saving (checkpoint-save, ~varies by model size) — Training pauses while model state dict is serialized to disk at configured intervals
Control Points
- Model Architecture Size (architecture-switch) — Controls: Number of layers, attention heads, and embedding dimensions via model_type presets like 'gpt-nano', 'gpt2'. Default: gpt-nano
- Training Hyperparameters (hyperparameter) — Controls: Learning rate, batch size, gradient clipping threshold, weight decay strength, and maximum training iterations. Default: learning_rate=5e-4
- Device Selection (device-selection) — Controls: Whether training runs on CPU or CUDA GPU based on availability. Default: auto
- Dropout Rate (hyperparameter) — Controls: Regularization strength during training by randomly zeroing attention and feedforward activations. Default: 0.1
Technology Stack
Core deep learning framework providing tensor operations, automatic differentiation, and neural network modules
Pattern matching for byte-pair encoding tokenization rules in BPE implementation
Compatibility layer for loading pretrained GPT-2 models from HuggingFace hub into minGPT architecture
Downloads tokenizer vocabulary files and merge rules from OpenAI's servers
Key Components
- GPT (transformer) — Core transformer model that processes token sequences through multiple attention and feedforward blocks to predict next tokens
mingpt/model.py - CausalSelfAttention (processor) — Implements multi-head self-attention with causal masking to prevent the model from attending to future tokens during training
mingpt/model.py - BPETokenizer (encoder) — Converts raw UTF-8 text into sequences of integer tokens using byte-pair encoding rules compatible with OpenAI's GPT-2 vocabulary
mingpt/bpe.py - Trainer (orchestrator) — Manages the complete training loop including data loading, forward/backward passes, gradient clipping, learning rate scheduling, and progress callbacks
mingpt/trainer.py - Block (processor) — Single transformer block containing layer normalization, causal self-attention, and feedforward network with residual connections
mingpt/model.py - AdditionDataset (loader) — Generates synthetic arithmetic problems by creating random addition equations and formatting them as text sequences for GPT training
projects/adder/adder.py - CharDataset (loader) — Loads text files and creates character-level training sequences by sliding a window over the text to generate input-target pairs
projects/chargpt/chargpt.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaCompare minGPT
Related Ml Training Repositories
Frequently Asked Questions
What is minGPT used for?
Trains minimal GPT transformers from scratch on character or number sequences karpathy/mingpt is a 7-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 9 files.
How is minGPT architected?
minGPT is organized into 4 architecture layers: Model Layer, Tokenization Layer, Training Layer, Demo Layer. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through minGPT?
Data moves through 7 stages: Text Tokenization → Dataset Batch Loading → Token Embedding → Transformer Processing → Next Token Prediction → .... Text enters through either the BPE tokenizer (for real text) or synthetic datasets (for arithmetic/character tasks), gets converted to integer token sequences, then flows through the GPT transformer which applies multiple layers of causal self-attention and feedforward processing to predict the next token in each sequence. During training, these predictions are compared against target tokens using cross-entropy loss, and gradients flow backward through the transformer to update weights via AdamW optimization. This pipeline design reflects a complex multi-stage processing system.
What technologies does minGPT use?
The core stack includes PyTorch (Core deep learning framework providing tensor operations, automatic differentiation, and neural network modules), regex (Pattern matching for byte-pair encoding tokenization rules in BPE implementation), transformers (Compatibility layer for loading pretrained GPT-2 models from HuggingFace hub into minGPT architecture), requests (Downloads tokenizer vocabulary files and merge rules from OpenAI's servers). A lean dependency footprint.
What system dynamics does minGPT have?
minGPT exhibits 3 data pools (Model Parameters, Training Checkpoints), 2 feedback loops, 4 control points, 2 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does minGPT use?
4 design patterns detected: Configuration as Code, Callback System, Pretrained Model Loading, Causal Masking.
How does minGPT compare to alternatives?
CodeSea has side-by-side architecture comparisons of minGPT with nanogpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.