karpathy/mingpt

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

24,189 stars Python 7 components

Trains minimal GPT transformers from scratch on character or number sequences

Text enters through either the BPE tokenizer (for real text) or synthetic datasets (for arithmetic/character tasks), gets converted to integer token sequences, then flows through the GPT transformer which applies multiple layers of causal self-attention and feedforward processing to predict the next token in each sequence. During training, these predictions are compared against target tokens using cross-entropy loss, and gradients flow backward through the transformer to update weights via AdamW optimization.

Under the hood, the system uses 2 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

A 7-component ml training. 9 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Text Tokenization — BPETokenizer.encode() converts raw UTF-8 strings into integer sequences using byte-pair encoding rules, mapping each text chunk to vocabulary indices in the range [0, vocab_size) [Raw text strings → TokenSequence] (config: vocab_size)
Dataset Batch Loading — DataLoader samples sequences from AdditionDataset or CharDataset, stacks them into batches of shape [batch_size, block_size], and moves tensors to the training device [TokenSequence → TrainingBatch] (config: batch_size, block_size, device)
Token Embedding — GPT.forward() starts by converting token IDs to dense vectors via nn.Embedding lookup and adds learned positional embeddings for each sequence position [TrainingBatch → Embedded sequences] (config: n_embd, vocab_size)
Transformer Processing — Each Block applies layer normalization, then CausalSelfAttention computes query-key-value matrices, masks future positions, and aggregates context, followed by a feedforward MLP with GELU activation [Embedded sequences → Contextualized representations] (config: n_layer, n_head, dropout)
Next Token Prediction — Final linear layer (lm_head) projects the last hidden state to vocabulary size, producing logits that represent unnormalized probabilities for each possible next token [Contextualized representations → LanguageModelLogits] (config: vocab_size)
Loss Computation — F.cross_entropy compares predicted logits against target tokens (input sequence shifted by one position), computing the negative log-likelihood loss for next-token prediction [LanguageModelLogits → Scalar loss]
Gradient Update — Trainer.run() calls loss.backward() to compute gradients, applies gradient clipping via torch.nn.utils.clip_grad_norm_, then optimizer.step() updates model parameters using AdamW [Scalar loss → Updated model weights] (config: learning_rate, grad_norm_clip, weight_decay)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TokenSequence mingpt/bpe.py
torch.LongTensor with shape [sequence_length] containing token IDs from vocabulary range [0, vocab_size)
Created by BPE encoding raw text into integer tokens, fed through transformer layers, and used to compute next-token prediction loss

ModelConfig mingpt/model.py
CfgNode with n_layer: int, n_head: int, n_embd: int, vocab_size: int, block_size: int, dropout: float, bias: bool
Created with default values or loaded from presets like 'gpt2', passed to GPT constructor to define architecture dimensions and behavior

TrainingBatch mingpt/trainer.py
torch.LongTensor with shape [batch_size, sequence_length] containing batched token sequences from dataset
Loaded from dataset by DataLoader, split into input/target pairs, fed through model to get logits, used for cross-entropy loss calculation and gradient updates

AttentionWeights mingpt/model.py
torch.Tensor with shape [batch_size, n_head, sequence_length, sequence_length] containing attention scores after causal masking and softmax
Computed from query-key dot products, masked to prevent attention to future positions, normalized via softmax, then used to weight value vectors

LanguageModelLogits mingpt/model.py
torch.Tensor with shape [batch_size, sequence_length, vocab_size] containing unnormalized scores for next token prediction
Output from final linear layer after transformer blocks, compared against target tokens for training loss or sampled from for text generation

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape weakly guarded

Input tensor x has shape (batch_size, sequence_length, n_embd) and sequence_length <= block_size, but only asserts n_embd % n_head == 0 during initialization

If this fails: If sequence_length > block_size, the causal mask indexing self.bias[:,:,:T,:T] will fail with IndexError since bias buffer is only allocated for block_size positions. If x has wrong shape, matrix multiplications silently produce incorrect attention weights

mingpt/model.py:CausalSelfAttention.forward

critical Domain unguarded

Input text is valid UTF-8 and can be encoded to bytes without errors, with no validation of encoding or handling of malformed unicode

If this fails: Malformed UTF-8 strings cause UnicodeEncodeError crashes during text.encode('utf-8'), breaking the entire tokenization pipeline and stopping training

mingpt/bpe.py:encode

critical Resource unguarded

GPU memory is sufficient to hold model parameters, optimizer state, and a full training batch simultaneously, with no memory monitoring or batch size adaptation

If this fails: CUDA out-of-memory errors crash training without recovery when batch_size * block_size * n_embd exceeds available VRAM, losing all training progress since last checkpoint

mingpt/trainer.py:run

critical Ordering unguarded

DataLoader yields batches where each sample x[:-1] serves as input and x[1:] as target, but never validates this sequence alignment contract

If this fails: If dataset returns pre-shifted sequences or wrong shapes, model trains on corrupted input-target pairs, learning meaningless patterns and producing garbage text during generation

mingpt/trainer.py:run

warning Scale unguarded

vocab_size fits in available memory for embedding layers (wte: vocab_size * n_embd + lm_head: n_embd * vocab_size parameters), with no size validation

If this fails: Extremely large vocab_size values (e.g., 1M+ tokens) silently allocate gigabytes for embedding matrices, causing memory exhaustion during model initialization before any helpful error message

mingpt/model.py:GPT.__init__

warning Environment unguarded

Internet connection exists for downloading encoder.json and vocab.bpe from OpenAI's servers, with no offline fallback or connection validation

If this fails: Network failures or OpenAI URL changes cause requests.get() to hang or fail, breaking tokenizer initialization and preventing any text processing in air-gapped environments

mingpt/bpe.py:get_encoder

warning Temporal unguarded

Downloaded BPE files (encoder.json, vocab.bpe) remain valid and compatible with the expected format indefinitely, with no version checking or corruption detection

If this fails: If OpenAI updates file formats or files become corrupted during download, json.loads() fails with cryptic parsing errors, or merge operations produce wrong token sequences

mingpt/bpe.py:BPETokenizer.__init__

warning Contract unguarded

Generated arithmetic strings always fit within block_size tokens after BPE encoding, but never validates actual encoded length

If this fails: Long addition problems (many digits) get silently truncated by DataLoader collation, causing model to train on incomplete equations and fail on longer problems at test time

projects/adder/adder.py:AdditionDataset.__getitem__

warning Domain unguarded

Temperature parameter > 0 for sampling, but allows temperature=0 which causes division by zero in softmax scaling logits/temperature

If this fails: temperature=0 produces NaN logits after division, leading to invalid probability distributions and random token sampling instead of deterministic greedy selection

mingpt/model.py:GPT.generate

info Scale unguarded

grad_norm_clip value is reasonable for the model size (default 1.0), but never adapts to actual gradient magnitudes or provides feedback on clipping frequency

If this fails: Too aggressive clipping (small clip value with large model) prevents learning by constantly truncating gradients, while too loose clipping allows gradient explosions, both silently degrading training

mingpt/trainer.py:run

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Parameters (state-store)
Learned transformer weights including token embeddings, attention projections, and feedforward layers that accumulate gradients during training

Training Checkpoints (checkpoint)
Periodic snapshots of model state and optimizer state saved to disk during training for resumption and evaluation

BPE Vocabulary Cache (cache)
Loaded merge rules and vocabulary mappings stored in memory to avoid re-downloading OpenAI's tokenizer files on every instantiation

Feedback Loops

Training Loop (training-loop, reinforcing) — Trigger: Trainer.run() iterates over DataLoader batches. Action: Each iteration computes forward pass, loss, backward pass, and parameter update via optimizer.step(). Exit: Reaches max_iters limit or manual interruption.
Learning Rate Scheduling (convergence, balancing) — Trigger: Trainer callback system after each training step. Action: Adjusts learning rate based on training progress or validation metrics. Exit: Training completion.

Delays

Gradient Accumulation (batch-window, ~accumulation_steps iterations) — Gradients accumulate across mini-batches before optimizer.step() applies the combined update
Checkpoint Saving (checkpoint-save, ~varies by model size) — Training pauses while model state dict is serialized to disk at configured intervals

Control Points

Model Architecture Size (architecture-switch) — Controls: Number of layers, attention heads, and embedding dimensions via model_type presets like 'gpt-nano', 'gpt2'. Default: gpt-nano
Training Hyperparameters (hyperparameter) — Controls: Learning rate, batch size, gradient clipping threshold, weight decay strength, and maximum training iterations. Default: learning_rate=5e-4
Device Selection (device-selection) — Controls: Whether training runs on CPU or CUDA GPU based on availability. Default: auto
Dropout Rate (hyperparameter) — Controls: Regularization strength during training by randomly zeroing attention and feedforward activations. Default: 0.1

Technology Stack

PyTorch (framework)
Core deep learning framework providing tensor operations, automatic differentiation, and neural network modules

regex (library)
Pattern matching for byte-pair encoding tokenization rules in BPE implementation

transformers (library)
Compatibility layer for loading pretrained GPT-2 models from HuggingFace hub into minGPT architecture

requests (library)
Downloads tokenizer vocabulary files and merge rules from OpenAI's servers

Key Components

GPT (transformer) — Core transformer model that processes token sequences through multiple attention and feedforward blocks to predict next tokens mingpt/model.py
CausalSelfAttention (processor) — Implements multi-head self-attention with causal masking to prevent the model from attending to future tokens during training mingpt/model.py
BPETokenizer (encoder) — Converts raw UTF-8 text into sequences of integer tokens using byte-pair encoding rules compatible with OpenAI's GPT-2 vocabulary mingpt/bpe.py
Trainer (orchestrator) — Manages the complete training loop including data loading, forward/backward passes, gradient clipping, learning rate scheduling, and progress callbacks mingpt/trainer.py
Block (processor) — Single transformer block containing layer normalization, causal self-attention, and feedforward network with residual connections mingpt/model.py
AdditionDataset (loader) — Generates synthetic arithmetic problems by creating random addition equations and formatting them as text sequences for GPT training projects/adder/adder.py
CharDataset (loader) — Loads text files and creates character-level training sequences by sliding a window over the text to generate input-target pairs projects/chargpt/chargpt.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare minGPT

Related Ml Training Repositories

Frequently Asked Questions

What is minGPT used for?

Trains minimal GPT transformers from scratch on character or number sequences karpathy/mingpt is a 7-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 9 files.

How is minGPT architected?

minGPT is organized into 4 architecture layers: Model Layer, Tokenization Layer, Training Layer, Demo Layer. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through minGPT?

Data moves through 7 stages: Text Tokenization → Dataset Batch Loading → Token Embedding → Transformer Processing → Next Token Prediction → .... Text enters through either the BPE tokenizer (for real text) or synthetic datasets (for arithmetic/character tasks), gets converted to integer token sequences, then flows through the GPT transformer which applies multiple layers of causal self-attention and feedforward processing to predict the next token in each sequence. During training, these predictions are compared against target tokens using cross-entropy loss, and gradients flow backward through the transformer to update weights via AdamW optimization. This pipeline design reflects a complex multi-stage processing system.

What technologies does minGPT use?

The core stack includes PyTorch (Core deep learning framework providing tensor operations, automatic differentiation, and neural network modules), regex (Pattern matching for byte-pair encoding tokenization rules in BPE implementation), transformers (Compatibility layer for loading pretrained GPT-2 models from HuggingFace hub into minGPT architecture), requests (Downloads tokenizer vocabulary files and merge rules from OpenAI's servers). A lean dependency footprint.

What system dynamics does minGPT have?

minGPT exhibits 3 data pools (Model Parameters, Training Checkpoints), 2 feedback loops, 4 control points, 2 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does minGPT use?

4 design patterns detected: Configuration as Code, Callback System, Pretrained Model Loading, Causal Masking.

How does minGPT compare to alternatives?

CodeSea has side-by-side architecture comparisons of minGPT with nanogpt. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

karpathy/mingpt

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Explore the interactive analysis

Compare minGPT

minGPT vs Nanogpt

Related Ml Training Repositories

tensorflow/tensorflow

automatic1111/stable-diffusion-webui

huggingface/transformers

ggml-org/llama.cpp

pytorch/pytorch

openai/whisper

Frequently Asked Questions