Hidden Assumptions in composer
11 assumptions this code never checks · 4 critical · spanning Shape, Contract, Ordering, Domain, Scale, Environment, Temporal, Resource
Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at mosaicml/composer and picked out the few most likely to cause trouble. The full list is just below.
Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".
If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch
If a replacement function returns None or a module with different forward() signature, training silently uses wrong attention mechanism or crashes with 'NoneType has no attribute' errors
If called after optimizer initialization, optimizer state becomes misaligned with model parameters, causing training to diverge or crash with 'parameter count mismatch' errors
Show everything (8 more)
GPT2Attention modules use causal attention (decoder-only) and have 'num_heads' attribute that divides evenly into the model's hidden dimension
If this fails: If applied to encoder-decoder models or models with non-standard attention shapes, ALiBi bias matrix has wrong causality or dimension, producing incorrect attention weights without error
composer/algorithms/alibi/attention_surgery_functions/_gpt2.py:gpt2_attention_converter
max_sequence_length parameter is reasonable for GPU memory (typically < 8192 tokens) and position_ids buffer fits in memory
If this fails: Very large max_sequence_length values (e.g., 1M tokens) cause OOM during buffer allocation or create inefficient attention computations that appear to hang
composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter
All model parameters are on the same device and next(new_module.parameters()).device returns a valid device
If this fails: If model has no parameters or parameters are on different devices, position_ids buffer is created on wrong device, causing 'tensor on different device' errors during forward pass
composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter
Target modules have the expected position embedding attribute name (e.g., 'position_embeddings', 'wpe') and it's a torch.nn.Embedding layer
If this fails: If position embedding attribute doesn't exist or is not an Embedding, function silently does nothing or crashes with AttributeError, leaving positional information intact
composer/algorithms/alibi/attention_surgery_functions/utils.py:zero_and_freeze_expand_position_embeddings
ALiBi algorithm only needs to run once during Event.INIT and the model structure remains static afterward
If this fails: If model architecture changes during training (e.g., dynamic layer addition), ALiBi modifications are lost and position embeddings may be re-enabled, degrading performance without warning
composer/algorithms/alibi/alibi.py:Alibi.match
ALiBi bias slopes are computed using powers of 2 and the number of heads is compatible with the geometric progression pattern
If this fails: For unusual head counts or very large numbers of heads (>64), bias slopes become extremely small or large, causing numerical instability in attention scores
composer/algorithms/alibi/attention_surgery_functions/utils.py:register_alibi
Algorithm classes imported in __init__.py have stable interfaces and their dependencies (like transformers library) are available at import time
If this fails: If transformers library is missing or incompatible version is installed, imports fail with MissingConditionalImportError even for users not using NLP algorithms
composer/algorithms/__init__.py
The registry dict can hold references to all registered module types and replacement functions without memory pressure
If this fails: With hundreds of different module types registered, the registry could consume significant memory and slow down module lookup, especially in large model ensembles
composer/algorithms/alibi/attention_surgery_functions/utils.py:PolicyRegistry
See the full structural analysis of composer: the pipeline, data models, and system behavior that put these assumptions in context.
Full analysis of mosaicml/composer →Compare composer
Frequently Asked Questions
What does composer assume that could break in production?
The one most likely to cause trouble: BERT attention modules always have exactly 'num_heads' attribute and query/key tensors with shape (batch, num_heads, seq_len, head_dim) where seq_len <= max_sequence_length If this fails, If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch
How many hidden assumptions does composer have?
CodeSea found 11 assumptions composer relies on but never validates, 4 of them critical, spanning Shape, Contract, Ordering, Domain, Scale, Environment, Temporal, Resource. Most are routine — the analysis flags the two or three most likely to actually bite.
What is a hidden assumption?
Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.