Hidden Assumptions in composer

Q: How many hidden assumptions does composer have?

CodeSea found 11 assumptions composer relies on but never validates, 4 of them critical, spanning Shape, Contract, Ordering, Domain, Scale, Environment, Temporal, Resource. Most are routine — the analysis flags the two or three most likely to actually bite.

Q: What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.

11 assumptions this code never checks · 4 critical · spanning Shape, Contract, Ordering, Domain, Scale, Environment, Temporal, Resource

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at mosaicml/composer and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch

Worth your attention first

If a replacement function returns None or a module with different forward() signature, training silently uses wrong attention mechanism or crashes with 'NoneType has no attribute' errors

Worth your attention first

If called after optimizer initialization, optimizer state becomes misaligned with model parameters, causing training to diverge or crash with 'parameter count mismatch' errors

Show everything (8 more)

Domain

GPT2Attention modules use causal attention (decoder-only) and have 'num_heads' attribute that divides evenly into the model's hidden dimension

If this fails: If applied to encoder-decoder models or models with non-standard attention shapes, ALiBi bias matrix has wrong causality or dimension, producing incorrect attention weights without error

composer/algorithms/alibi/attention_surgery_functions/_gpt2.py:gpt2_attention_converter

Scale

max_sequence_length parameter is reasonable for GPU memory (typically < 8192 tokens) and position_ids buffer fits in memory

If this fails: Very large max_sequence_length values (e.g., 1M tokens) cause OOM during buffer allocation or create inefficient attention computations that appear to hang

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter

Environment

All model parameters are on the same device and next(new_module.parameters()).device returns a valid device

If this fails: If model has no parameters or parameters are on different devices, position_ids buffer is created on wrong device, causing 'tensor on different device' errors during forward pass

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter

Contract

Target modules have the expected position embedding attribute name (e.g., 'position_embeddings', 'wpe') and it's a torch.nn.Embedding layer

If this fails: If position embedding attribute doesn't exist or is not an Embedding, function silently does nothing or crashes with AttributeError, leaving positional information intact

composer/algorithms/alibi/attention_surgery_functions/utils.py:zero_and_freeze_expand_position_embeddings

Temporal

ALiBi algorithm only needs to run once during Event.INIT and the model structure remains static afterward

If this fails: If model architecture changes during training (e.g., dynamic layer addition), ALiBi modifications are lost and position embeddings may be re-enabled, degrading performance without warning

composer/algorithms/alibi/alibi.py:Alibi.match

Domain

ALiBi bias slopes are computed using powers of 2 and the number of heads is compatible with the geometric progression pattern

If this fails: For unusual head counts or very large numbers of heads (>64), bias slopes become extremely small or large, causing numerical instability in attention scores

composer/algorithms/alibi/attention_surgery_functions/utils.py:register_alibi

Contract

Algorithm classes imported in __init__.py have stable interfaces and their dependencies (like transformers library) are available at import time

If this fails: If transformers library is missing or incompatible version is installed, imports fail with MissingConditionalImportError even for users not using NLP algorithms

composer/algorithms/__init__.py

Resource

The registry dict can hold references to all registered module types and replacement functions without memory pressure

If this fails: With hundreds of different module types registered, the registry could consume significant memory and slow down module lookup, especially in large model ensembles

composer/algorithms/alibi/attention_surgery_functions/utils.py:PolicyRegistry

See the full structural analysis of composer: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of mosaicml/composer →

Compare composer

composer vs pytorch lightning

Frequently Asked Questions

What does composer assume that could break in production?

The one most likely to cause trouble: BERT attention modules always have exactly 'num_heads' attribute and query/key tensors with shape (batch, num_heads, seq_len, head_dim) where seq_len <= max_sequence_length If this fails, If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch

How many hidden assumptions does composer have?

CodeSea found 11 assumptions composer relies on but never validates, 4 of them critical, spanning Shape, Contract, Ordering, Domain, Scale, Environment, Temporal, Resource. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.