mosaicml/composer
Supercharge Your Model Training
Optimizes deep learning model training with efficient algorithms, distributed scaling, and performance monitoring
Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption.
Under the hood, the system uses 4 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.
A 9-component ml training. 397 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption.
- Initialize training infrastructure — Trainer creates State object with model, optimizers, data loaders, and initializes all algorithms and callbacks with their configurations. Sets up distributed training if multi-node, configures device placement, and establishes logging systems. [AlgorithmConfig → State] (config: parallelism.fsdp, parallelism.tp, training.device)
- Apply structural algorithms — Engine dispatches INIT event, triggering algorithms like ChannelsLast, BlurPool, and ALiBi to perform model surgery. These algorithms use module_surgery to traverse the model and replace layers — ALiBi removes position embeddings and modifies attention, BlurPool adds anti-aliasing filters to convolutions. [State → State] (config: algorithms.alibi.max_sequence_length, algorithms.blurpool.replace_convs, algorithms.channels_last.enabled)
- Load and transform training batch — DataSpec extracts batch from DataLoader and ensures proper device placement. Data augmentation algorithms like AugMix and ColOut check for BATCH_START event and apply their transformations to input tensors, modifying the batch in-place. [TrainingBatch → TrainingBatch] (config: data.batch_size, algorithms.augmix.severity, algorithms.colout.p_row)
- Execute forward pass with monitoring — Engine dispatches BEFORE_FORWARD event, then model processes the batch. Profiler captures GPU memory usage and compute metrics. Engine dispatches AFTER_FORWARD event, allowing algorithms to examine or modify the outputs before loss computation. [TrainingBatch → State] (config: model.precision, profiler.enabled)
- Compute loss and execute backward pass — Loss function processes model outputs and targets. Engine dispatches BEFORE_BACKWARD, then PyTorch autograd computes gradients. Algorithms can modify gradients or loss values. Engine dispatches AFTER_BACKWARD for gradient-related algorithms. [State → State] (config: training.loss, training.grad_clip_norm)
- Update model parameters — Optimizer steps update model parameters using computed gradients. LR schedulers adjust learning rates based on training progress. Engine dispatches BATCH_END event for algorithms that need to track training statistics or modify parameters post-update. [State → State] (config: optimizer.lr, scheduler.warmup_steps)
- Checkpoint and log metrics — At configured intervals, CheckpointSaver serializes complete training state including model weights, optimizer state, random seeds, and algorithm-specific data. Loggers record training metrics, performance statistics, and algorithm-specific measurements to configured backends. [State → CheckpointState] (config: checkpoint.save_interval, logging.log_level)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
composer/core/state.pyTraining state object containing model: ComposerModel, optimizers: list[Optimizer], train_dataloader: DataLoader, eval_dataloaders: dict, batch: Any, outputs: Any, loss: Tensor, timestamp: Timestamp, max_duration: Time, precision: Precision, device: Device, rank_zero_seed: int
Created at training start, continuously updated during training loop, passed to all algorithms and callbacks for inspection and modification
composer/core/data_spec.pyGeneric batch structure with inputs: Any (typically Tensor[B, ...]), targets: Any (typically Tensor[B] or Tensor[B, ...]), depending on task — vision uses Tensor[B, C, H, W] inputs, NLP uses token_ids: Tensor[B, seq_len]
Loaded from DataLoader, potentially transformed by data augmentation algorithms, fed to model forward pass, then used for loss computation
composer/algorithms/*/Algorithm-specific dataclasses like FSDPConfig(activation_checkpointing: bool, backward_prefetch: str, cpu_offload: bool, data_parallel_shard_degree: int), BlurPoolConfig(replace_convs: bool, min_channels: int), AugMixConfig(severity: int, width: int, alpha: float)
Parsed from config files or constructed programmatically, used to initialize Algorithm instances with specific hyperparameters
composer/core/event.pyEnum with values like INIT, EPOCH_START, BATCH_START, BEFORE_FORWARD, AFTER_FORWARD, BEFORE_BACKWARD, AFTER_BACKWARD, BATCH_END, EPOCH_END — each representing a specific point in the training timeline
Generated by the training engine at specific points in the training loop, used by algorithms to determine when to apply their transformations
composer/checkpoint/Serializable dict containing state_dict: dict, model: dict, optimizers: list[dict], lr_schedulers: list[dict], algorithms: list[dict], callbacks: list[dict], timestamp: dict, rank_zero_seed: int, train_metrics: dict, eval_metrics: dict
Assembled from current training state, serialized to storage, later deserialized to restore exact training state including random number generator states
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
BERT attention modules always have exactly 'num_heads' attribute and query/key tensors with shape (batch, num_heads, seq_len, head_dim) where seq_len <= max_sequence_length
If this fails: If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch
composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_attention_converter
All registered ALiBi replacement functions will be called with valid torch.nn.Module instances and return modules compatible with the original's interface
If this fails: If a replacement function returns None or a module with different forward() signature, training silently uses wrong attention mechanism or crashes with 'NoneType has no attribute' errors
composer/algorithms/alibi/attention_surgery_functions/utils.py:PolicyRegistry.register
Model surgery must be applied before any optimizer state is created, since position embeddings are frozen and parameter counts change
If this fails: If called after optimizer initialization, optimizer state becomes misaligned with model parameters, causing training to diverge or crash with 'parameter count mismatch' errors
composer/algorithms/alibi/alibi.py:apply_alibi
GPT2Attention modules use causal attention (decoder-only) and have 'num_heads' attribute that divides evenly into the model's hidden dimension
If this fails: If applied to encoder-decoder models or models with non-standard attention shapes, ALiBi bias matrix has wrong causality or dimension, producing incorrect attention weights without error
composer/algorithms/alibi/attention_surgery_functions/_gpt2.py:gpt2_attention_converter
max_sequence_length parameter is reasonable for GPU memory (typically < 8192 tokens) and position_ids buffer fits in memory
If this fails: Very large max_sequence_length values (e.g., 1M tokens) cause OOM during buffer allocation or create inefficient attention computations that appear to hang
composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter
All model parameters are on the same device and next(new_module.parameters()).device returns a valid device
If this fails: If model has no parameters or parameters are on different devices, position_ids buffer is created on wrong device, causing 'tensor on different device' errors during forward pass
composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter
Target modules have the expected position embedding attribute name (e.g., 'position_embeddings', 'wpe') and it's a torch.nn.Embedding layer
If this fails: If position embedding attribute doesn't exist or is not an Embedding, function silently does nothing or crashes with AttributeError, leaving positional information intact
composer/algorithms/alibi/attention_surgery_functions/utils.py:zero_and_freeze_expand_position_embeddings
ALiBi algorithm only needs to run once during Event.INIT and the model structure remains static afterward
If this fails: If model architecture changes during training (e.g., dynamic layer addition), ALiBi modifications are lost and position embeddings may be re-enabled, degrading performance without warning
composer/algorithms/alibi/alibi.py:Alibi.match
ALiBi bias slopes are computed using powers of 2 and the number of heads is compatible with the geometric progression pattern
If this fails: For unusual head counts or very large numbers of heads (>64), bias slopes become extremely small or large, causing numerical instability in attention scores
composer/algorithms/alibi/attention_surgery_functions/utils.py:register_alibi
Algorithm classes imported in __init__.py have stable interfaces and their dependencies (like transformers library) are available at import time
If this fails: If transformers library is missing or incompatible version is installed, imports fail with MissingConditionalImportError even for users not using NLP algorithms
composer/algorithms/__init__.py
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Central mutable state containing current model, batch, loss, timestamp, and algorithm-specific state that persists across training steps
Persistent storage of complete training state snapshots enabling exact training resumption from any saved point
Collection of initialized algorithm instances that match against events and apply transformations to training state
Accumulates GPU utilization, memory usage, throughput measurements, and training statistics for performance analysis
Feedback Loops
- Event-Algorithm Loop (recursive, reinforcing) — Trigger: Engine dispatches training event. Action: Each algorithm checks match() against current state and event, applies transformations if matched. Exit: All algorithms processed for current event.
- Training Convergence Loop (training-loop, balancing) — Trigger: Training step completion. Action: Forward pass computes loss, backward pass computes gradients, optimizer updates parameters, loss feeds back to influence next iteration. Exit: Max duration reached or convergence criteria met.
- Checkpointing Loop (scheduled-job, reinforcing) — Trigger: Checkpoint interval reached. Action: Serialize current training state, save to storage, continue training. Exit: Training completes.
- Distributed Synchronization (backpressure, balancing) — Trigger: Gradient computation complete. Action: AllReduce gradients across workers, wait for all workers to complete before optimizer step. Exit: All workers synchronized.
Delays
- Model Surgery Compilation (compilation, ~seconds to minutes) — One-time delay during INIT event while algorithms modify model architecture and PyTorch recompiles computation graphs
- Distributed Gradient Sync (async-processing, ~milliseconds) — Workers wait for gradient AllReduce completion before optimizer step, creating synchronization barrier
- Checkpoint Serialization (checkpoint-save, ~seconds) — Training pauses while complete state is serialized to disk, frequency controlled by save_interval
- Data Loading Buffer (batch-window, ~variable) — DataLoader prefetching creates pipeline parallelism between data loading and model computation
Control Points
- Algorithm Selection (feature-flag) — Controls: Which efficiency algorithms are active during training — each can be enabled/disabled independently. Default: configurable per algorithm
- Precision Mode (precision-mode) — Controls: Model computation precision (fp32, fp16, bf16) affecting memory usage and training speed. Default: fp32
- FSDP Sharding Strategy (architecture-switch) — Controls: How model parameters are sharded across workers — full_shard, shard_grad_op, or no_shard. Default: full_shard
- Activation Checkpointing (hyperparameter) — Controls: Whether to trade compute for memory by recomputing activations during backward pass. Default: False
- Backward Prefetch Strategy (performance-toggle) — Controls: When to prefetch parameters during backward pass — BACKWARD_PRE or BACKWARD_POST. Default: BACKWARD_POST
- Device Selection (device-selection) — Controls: Which compute device (CPU, GPU, TPU) to use for training. Default: cuda
Technology Stack
Core deep learning framework providing model definition, autograd, and distributed training primitives
Provides pre-trained model architectures (BERT, GPT-2, RoBERTa) that algorithms like ALiBi modify through model surgery
Computer vision utilities and transformations used by image augmentation algorithms like AugMix and ColOut
Fully Sharded Data Parallel implementation for memory-efficient distributed training of large models
Image processing library used by data augmentation algorithms to manipulate PIL Images before tensor conversion
Numerical computing for algorithm implementations, particularly in data augmentation and mathematical transformations
Testing framework with custom markers for GPU tests, distributed tests, and algorithm-specific test configurations
Build system for packaging and distribution of the Composer library
Key Components
- Trainer (orchestrator) — Central coordinator that manages the complete training lifecycle — initializes models and optimizers, runs the training loop with event dispatch, handles distributed training setup, and coordinates checkpointing and evaluation
composer/trainer/trainer.py - Engine (executor) — Core training loop executor that dispatches events at specific training milestones, allowing algorithms and callbacks to hook into the training process and modify behavior
composer/core/engine.py - Algorithm (transformer) — Base class for efficiency algorithms that implement match() to detect when to run and apply() to modify the training state — enables composable training optimizations
composer/core/algorithm.py - PolicyRegistry (registry) — Maps PyTorch module types to their ALiBi replacement functions, enabling automatic model surgery that replaces attention mechanisms with ALiBi-compatible versions
composer/algorithms/alibi/attention_surgery_functions/utils.py - module_surgery (transformer) — Runtime model modification system that traverses module hierarchies and replaces matching modules with optimized versions — used by algorithms like BlurPool and ALiBi
composer/utils/module_surgery.py - DataSpec (adapter) — Bridges different data loading patterns by wrapping DataLoaders and defining how to extract device-compatible batches, enabling consistent data handling across different model types
composer/core/data_spec.py - CheckpointSaver (serializer) — Handles the serialization of complete training state to persistent storage, including model weights, optimizer state, random seeds, and algorithm-specific state for exact resumption
composer/checkpoint/checkpoint_saver.py - FSDPConfig (factory) — Configuration factory for Fully Sharded Data Parallel setup with parameters like activation_checkpointing, backward_prefetch, cpu_offload controlling memory and communication optimization
composer/utils/parallelism.py - Profiler (monitor) — Comprehensive performance monitoring system that tracks GPU utilization, memory usage, throughput metrics, and generates detailed performance reports for optimization
composer/profiler/
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaCompare composer
Related Ml Training Repositories
Frequently Asked Questions
What is composer used for?
Optimizes deep learning model training with efficient algorithms, distributed scaling, and performance monitoring mosaicml/composer is a 9-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 397 files.
How is composer architected?
composer is organized into 5 architecture layers: Training Orchestration, Algorithm Registry, Model Surgery, Distributed Infrastructure, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through composer?
Data moves through 7 stages: Initialize training infrastructure → Apply structural algorithms → Load and transform training batch → Execute forward pass with monitoring → Compute loss and execute backward pass → .... Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption. This pipeline design reflects a complex multi-stage processing system.
What technologies does composer use?
The core stack includes PyTorch (Core deep learning framework providing model definition, autograd, and distributed training primitives), Transformers (Provides pre-trained model architectures (BERT, GPT-2, RoBERTa) that algorithms like ALiBi modify through model surgery), TorchVision (Computer vision utilities and transformations used by image augmentation algorithms like AugMix and ColOut), FSDP (Fully Sharded Data Parallel implementation for memory-efficient distributed training of large models), Pillow (Image processing library used by data augmentation algorithms to manipulate PIL Images before tensor conversion), NumPy (Numerical computing for algorithm implementations, particularly in data augmentation and mathematical transformations), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does composer have?
composer exhibits 4 data pools (Training State, Checkpoint Storage), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle recursive and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does composer use?
5 design patterns detected: Two-Way Callbacks, Module Surgery Registry, Event-Driven Architecture, Composable State Management, Functional Algorithm Interface.
How does composer compare to alternatives?
CodeSea has side-by-side architecture comparisons of composer with pytorch-lightning. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.