Hidden Assumptions in nanoGPT

12 assumptions this code never checks · 5 critical · spanning Shape, Domain, Environment, Resource, Contract, Scale, Ordering, Temporal

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at karpathy/nanogpt and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

If a dataset has fewer tokens than block_size (1024 by default), torch.randint generates negative upper bounds causing crashes or invalid indices that corrupt training batches

Worth your attention first

If n_embd or n_head are zero, negative, or non-integer (from config file corruption), the assertion passes but linear layers get invalid dimensions causing silent NaN gradients

Worth your attention first

Malicious config files can execute arbitrary code with full Python access - rm -rf /, network calls, data exfiltration - creating a remote code execution vulnerability

Show everything (9 more)

Resource

GPU memory can hold batch_size * block_size tensors (12 * 1024 = 12,288 tokens by default) plus model parameters without checking available VRAM

If this fails: Large models or small GPUs cause CUDA out-of-memory errors that crash training without graceful degradation or helpful error messages about required vs available memory

bench.py:get_batch

Contract

Checkpoint files contain a 'model_args' key with GPTConfig-compatible parameters and that model architecture hasn't changed between save and load

If this fails: Loading checkpoints from different model versions or corrupted files fails silently or loads incompatible weights into wrong layers, leading to degraded performance that's hard to diagnose

train.py:checkpoint loading

Scale

Input sequence length never exceeds the block_size limit set during model initialization, relying on data preprocessing to ensure this

If this fails: Sequences longer than block_size cause attention mask mismatches and index errors in the causal mask, breaking the transformer's autoregressive property

model.py:forward

Ordering

Model is in eval() mode during validation and gets switched back to train() mode, but this isn't enforced if estimate_loss() is called directly

If this fails: Using training mode during validation applies dropout and batch normalization updates to test data, giving inaccurate loss estimates that mislead checkpoint selection

train.py:estimate_loss

Temporal

The tiktoken tokenizer used for sampling exactly matches the tokenizer used during training, but this isn't verified

If this fails: Tokenizer version mismatches cause vocabulary differences where tokens get encoded/decoded incorrectly, producing garbled text output that's hard to trace back to the encoding mismatch

sample.py:tiktoken encoding

Environment

PyTorch 2.0 torch.compile() works correctly with the specific model architecture and CUDA version, defaulting compile=True

If this fails: Compilation failures on unsupported hardware or PyTorch versions cause cryptic errors or silent performance degradation, making it unclear whether to disable compilation

train.py:torch.compile

Domain

Temperature parameter is positive (0.8 default) but negative values aren't prevented, which would invert probability distributions

If this fails: Negative temperature causes the model to strongly prefer low-probability tokens, generating nonsensical text that looks like successful sampling but with inverted semantics

sample.py:temperature scaling

Resource

gradient_accumulation_steps divides evenly into the total training iterations, but remainder steps get processed differently

If this fails: Final training steps accumulate fewer gradients than specified, causing learning rate and effective batch size to change unpredictably near the end of training

train.py:gradient_accumulation

Contract

The 'openwebtext' dataset exists in data/openwebtext/train.bin with proper uint16 format, hardcoded without validation

If this fails: Missing or corrupted dataset files cause numpy memmap failures that crash benchmarking with unhelpful error messages about file access

bench.py:data loading

See the full structural analysis of nanoGPT: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of karpathy/nanogpt →

Compare nanoGPT

Frequently Asked Questions

What does nanoGPT assume that could break in production?

The one most likely to cause trouble: Memory-mapped data files contain enough tokens that random sampling with ix = torch.randint(len(data) - block_size, (batch_size,)) never goes out of bounds, specifically that len(data) > block_size If this fails, If a dataset has fewer tokens than block_size (1024 by default), torch.randint generates negative upper bounds causing crashes or invalid indices that corrupt training batches

How many hidden assumptions does nanoGPT have?

CodeSea found 12 assumptions nanoGPT relies on but never validates, 5 of them critical, spanning Shape, Domain, Environment, Resource, Contract, Scale, Ordering, Temporal. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.