Hidden Assumptions in pytorch-lightning
12 assumptions this code never checks · 4 critical · spanning Contract, Shape, Environment, Domain, Resource, Scale, Temporal, Ordering
Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at lightning-ai/pytorch-lightning and picked out the few most likely to cause trouble. The full list is just below.
Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".
If training_step returns wrong type (e.g., dict instead of Tensor) or validation_step returns non-dict, trainer silently fails or produces cryptic errors during backward pass
If input has different spatial dimensions or channels, fc1 receives wrong tensor size causing RuntimeError about mismatched dimensions during forward pass
DataLoader fails with TypeError when trying to use None as num_workers, causing training to crash at data loading stage
Show everything (9 more)
All linear layer inner dimensions are divisible by 16 for Float8 conversion (except decoder which is filtered out), but never validates this mathematical constraint
If this fails: Float8 conversion fails silently or produces incorrect results when linear layers have dimensions not divisible by 16, leading to subtle numerical errors
examples/fabric/fp8_distributed_transformer/train.py:convert_to_float8_training
Checkpoint directory is writable and has sufficient disk space for model state_dict serialization, but never checks filesystem permissions or available space
If this fails: Checkpoint saving fails mid-training with disk full or permission errors, losing training progress without graceful recovery
examples/fabric/build_your_own_trainer/trainer.py:_save_checkpoint
Validation step returns a dict with string keys for metric names, but doesn't validate dict structure or key types before logging
If this fails: Non-string keys or nested dicts cause logger failures or metric aggregation errors, breaking validation monitoring
examples/fabric/build_your_own_trainer/trainer.py:_run_validation
GPU memory can handle batch_size=128 with 64x64x3 images plus generator/discriminator models, roughly 200MB+ per batch, but never checks available VRAM
If this fails: Training crashes with CUDA out of memory errors when GPU has insufficient memory, requiring manual batch size tuning
examples/fabric/dcgan/train_fabric.py:batch_size=128
Validation frequency counting is based on completed epochs, but doesn't account for early stopping or interrupted training affecting validation timing
If this fails: Validation may not run at expected intervals when training is interrupted and resumed, potentially missing important metric checkpoints
examples/fabric/build_your_own_trainer/trainer.py:validation_frequency
CelebA dataset exists in 'data/' directory and is properly formatted, but never validates dataset integrity or file permissions
If this fails: Training fails with cryptic errors during data loading if dataset is corrupted, missing, or has wrong file structure
examples/fabric/dcgan/train_fabric.py:dataroot='data/'
Model setup, optimizer configuration, and data preparation happen in specific order before training loop starts, but doesn't enforce or validate this sequencing
If this fails: If components are accessed before proper initialization (e.g., calling backward before fabric.setup), training produces confusing errors about uninitialized state
examples/fabric/build_your_own_trainer/trainer.py:fit method
MNIST images are normalized using mean=0.1307, std=0.3081 which are dataset-specific statistics, but these values are hardcoded without validation
If this fails: Using different datasets or preprocessing pipelines with wrong normalization values leads to poor model convergence and incorrect results
examples/fabric/image_classifier/train_fabric.py:transform normalization
Early stopping callback implements a 'should_stop' property or method that returns boolean, but never validates callback interface
If this fails: Custom callbacks without proper interface cause AttributeError during training, breaking early stopping logic
examples/fabric/build_your_own_trainer/trainer.py:_should_stop_early
See the full structural analysis of pytorch-lightning: the pipeline, data models, and system behavior that put these assumptions in context.
Full analysis of lightning-ai/pytorch-lightning →Compare pytorch-lightning
Frequently Asked Questions
What does pytorch-lightning assume that could break in production?
The one most likely to cause trouble: LightningModule passed to fit() implements training_step(batch, batch_idx) returning a loss Tensor and optionally validation_step(batch, batch_idx) returning metrics dict, but never validates these method signatures or return types If this fails, If training_step returns wrong type (e.g., dict instead of Tensor) or validation_step returns non-dict, trainer silently fails or produces cryptic errors during backward pass
How many hidden assumptions does pytorch-lightning have?
CodeSea found 12 assumptions pytorch-lightning relies on but never validates, 4 of them critical, spanning Contract, Shape, Environment, Domain, Resource, Scale, Temporal, Ordering. Most are routine — the analysis flags the two or three most likely to actually bite.
What is a hidden assumption?
Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.