mosaicml/composer

Supercharge Your Model Training

5,477 stars Python 9 components

Optimizes deep learning model training with efficient algorithms, distributed scaling, and performance monitoring

Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption.

Under the hood, the system uses 4 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.

A 9-component ml training. 397 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption.

  1. Initialize training infrastructure — Trainer creates State object with model, optimizers, data loaders, and initializes all algorithms and callbacks with their configurations. Sets up distributed training if multi-node, configures device placement, and establishes logging systems. [AlgorithmConfig → State] (config: parallelism.fsdp, parallelism.tp, training.device)
  2. Apply structural algorithms — Engine dispatches INIT event, triggering algorithms like ChannelsLast, BlurPool, and ALiBi to perform model surgery. These algorithms use module_surgery to traverse the model and replace layers — ALiBi removes position embeddings and modifies attention, BlurPool adds anti-aliasing filters to convolutions. [State → State] (config: algorithms.alibi.max_sequence_length, algorithms.blurpool.replace_convs, algorithms.channels_last.enabled)
  3. Load and transform training batch — DataSpec extracts batch from DataLoader and ensures proper device placement. Data augmentation algorithms like AugMix and ColOut check for BATCH_START event and apply their transformations to input tensors, modifying the batch in-place. [TrainingBatch → TrainingBatch] (config: data.batch_size, algorithms.augmix.severity, algorithms.colout.p_row)
  4. Execute forward pass with monitoring — Engine dispatches BEFORE_FORWARD event, then model processes the batch. Profiler captures GPU memory usage and compute metrics. Engine dispatches AFTER_FORWARD event, allowing algorithms to examine or modify the outputs before loss computation. [TrainingBatch → State] (config: model.precision, profiler.enabled)
  5. Compute loss and execute backward pass — Loss function processes model outputs and targets. Engine dispatches BEFORE_BACKWARD, then PyTorch autograd computes gradients. Algorithms can modify gradients or loss values. Engine dispatches AFTER_BACKWARD for gradient-related algorithms. [State → State] (config: training.loss, training.grad_clip_norm)
  6. Update model parameters — Optimizer steps update model parameters using computed gradients. LR schedulers adjust learning rates based on training progress. Engine dispatches BATCH_END event for algorithms that need to track training statistics or modify parameters post-update. [State → State] (config: optimizer.lr, scheduler.warmup_steps)
  7. Checkpoint and log metrics — At configured intervals, CheckpointSaver serializes complete training state including model weights, optimizer state, random seeds, and algorithm-specific data. Loggers record training metrics, performance statistics, and algorithm-specific measurements to configured backends. [State → CheckpointState] (config: checkpoint.save_interval, logging.log_level)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

State composer/core/state.py
Training state object containing model: ComposerModel, optimizers: list[Optimizer], train_dataloader: DataLoader, eval_dataloaders: dict, batch: Any, outputs: Any, loss: Tensor, timestamp: Timestamp, max_duration: Time, precision: Precision, device: Device, rank_zero_seed: int
Created at training start, continuously updated during training loop, passed to all algorithms and callbacks for inspection and modification
TrainingBatch composer/core/data_spec.py
Generic batch structure with inputs: Any (typically Tensor[B, ...]), targets: Any (typically Tensor[B] or Tensor[B, ...]), depending on task — vision uses Tensor[B, C, H, W] inputs, NLP uses token_ids: Tensor[B, seq_len]
Loaded from DataLoader, potentially transformed by data augmentation algorithms, fed to model forward pass, then used for loss computation
AlgorithmConfig composer/algorithms/*/
Algorithm-specific dataclasses like FSDPConfig(activation_checkpointing: bool, backward_prefetch: str, cpu_offload: bool, data_parallel_shard_degree: int), BlurPoolConfig(replace_convs: bool, min_channels: int), AugMixConfig(severity: int, width: int, alpha: float)
Parsed from config files or constructed programmatically, used to initialize Algorithm instances with specific hyperparameters
Event composer/core/event.py
Enum with values like INIT, EPOCH_START, BATCH_START, BEFORE_FORWARD, AFTER_FORWARD, BEFORE_BACKWARD, AFTER_BACKWARD, BATCH_END, EPOCH_END — each representing a specific point in the training timeline
Generated by the training engine at specific points in the training loop, used by algorithms to determine when to apply their transformations
CheckpointState composer/checkpoint/
Serializable dict containing state_dict: dict, model: dict, optimizers: list[dict], lr_schedulers: list[dict], algorithms: list[dict], callbacks: list[dict], timestamp: dict, rank_zero_seed: int, train_metrics: dict, eval_metrics: dict
Assembled from current training state, serialized to storage, later deserialized to restore exact training state including random number generator states

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

BERT attention modules always have exactly 'num_heads' attribute and query/key tensors with shape (batch, num_heads, seq_len, head_dim) where seq_len <= max_sequence_length

If this fails: If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_attention_converter
critical Contract unguarded

All registered ALiBi replacement functions will be called with valid torch.nn.Module instances and return modules compatible with the original's interface

If this fails: If a replacement function returns None or a module with different forward() signature, training silently uses wrong attention mechanism or crashes with 'NoneType has no attribute' errors

composer/algorithms/alibi/attention_surgery_functions/utils.py:PolicyRegistry.register
critical Ordering unguarded

Model surgery must be applied before any optimizer state is created, since position embeddings are frozen and parameter counts change

If this fails: If called after optimizer initialization, optimizer state becomes misaligned with model parameters, causing training to diverge or crash with 'parameter count mismatch' errors

composer/algorithms/alibi/alibi.py:apply_alibi
critical Domain weakly guarded

GPT2Attention modules use causal attention (decoder-only) and have 'num_heads' attribute that divides evenly into the model's hidden dimension

If this fails: If applied to encoder-decoder models or models with non-standard attention shapes, ALiBi bias matrix has wrong causality or dimension, producing incorrect attention weights without error

composer/algorithms/alibi/attention_surgery_functions/_gpt2.py:gpt2_attention_converter
warning Scale unguarded

max_sequence_length parameter is reasonable for GPU memory (typically < 8192 tokens) and position_ids buffer fits in memory

If this fails: Very large max_sequence_length values (e.g., 1M tokens) cause OOM during buffer allocation or create inefficient attention computations that appear to hang

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter
warning Environment unguarded

All model parameters are on the same device and next(new_module.parameters()).device returns a valid device

If this fails: If model has no parameters or parameters are on different devices, position_ids buffer is created on wrong device, causing 'tensor on different device' errors during forward pass

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter
warning Contract unguarded

Target modules have the expected position embedding attribute name (e.g., 'position_embeddings', 'wpe') and it's a torch.nn.Embedding layer

If this fails: If position embedding attribute doesn't exist or is not an Embedding, function silently does nothing or crashes with AttributeError, leaving positional information intact

composer/algorithms/alibi/attention_surgery_functions/utils.py:zero_and_freeze_expand_position_embeddings
warning Temporal unguarded

ALiBi algorithm only needs to run once during Event.INIT and the model structure remains static afterward

If this fails: If model architecture changes during training (e.g., dynamic layer addition), ALiBi modifications are lost and position embeddings may be re-enabled, degrading performance without warning

composer/algorithms/alibi/alibi.py:Alibi.match
info Domain unguarded

ALiBi bias slopes are computed using powers of 2 and the number of heads is compatible with the geometric progression pattern

If this fails: For unusual head counts or very large numbers of heads (>64), bias slopes become extremely small or large, causing numerical instability in attention scores

composer/algorithms/alibi/attention_surgery_functions/utils.py:register_alibi
info Contract weakly guarded

Algorithm classes imported in __init__.py have stable interfaces and their dependencies (like transformers library) are available at import time

If this fails: If transformers library is missing or incompatible version is installed, imports fail with MissingConditionalImportError even for users not using NLP algorithms

composer/algorithms/__init__.py

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Training State (state-store)
Central mutable state containing current model, batch, loss, timestamp, and algorithm-specific state that persists across training steps
Checkpoint Storage (file-store)
Persistent storage of complete training state snapshots enabling exact training resumption from any saved point
Algorithm Registry (registry)
Collection of initialized algorithm instances that match against events and apply transformations to training state
Performance Metrics Buffer (buffer)
Accumulates GPU utilization, memory usage, throughput measurements, and training statistics for performance analysis

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Core deep learning framework providing model definition, autograd, and distributed training primitives
Transformers (library)
Provides pre-trained model architectures (BERT, GPT-2, RoBERTa) that algorithms like ALiBi modify through model surgery
TorchVision (library)
Computer vision utilities and transformations used by image augmentation algorithms like AugMix and ColOut
FSDP (runtime)
Fully Sharded Data Parallel implementation for memory-efficient distributed training of large models
Pillow (library)
Image processing library used by data augmentation algorithms to manipulate PIL Images before tensor conversion
NumPy (library)
Numerical computing for algorithm implementations, particularly in data augmentation and mathematical transformations
Pytest (testing)
Testing framework with custom markers for GPU tests, distributed tests, and algorithm-specific test configurations
Setuptools (build)
Build system for packaging and distribution of the Composer library

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare composer

Related Ml Training Repositories

Frequently Asked Questions

What is composer used for?

Optimizes deep learning model training with efficient algorithms, distributed scaling, and performance monitoring mosaicml/composer is a 9-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 397 files.

How is composer architected?

composer is organized into 5 architecture layers: Training Orchestration, Algorithm Registry, Model Surgery, Distributed Infrastructure, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through composer?

Data moves through 7 stages: Initialize training infrastructure → Apply structural algorithms → Load and transform training batch → Execute forward pass with monitoring → Compute loss and execute backward pass → .... Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption. This pipeline design reflects a complex multi-stage processing system.

What technologies does composer use?

The core stack includes PyTorch (Core deep learning framework providing model definition, autograd, and distributed training primitives), Transformers (Provides pre-trained model architectures (BERT, GPT-2, RoBERTa) that algorithms like ALiBi modify through model surgery), TorchVision (Computer vision utilities and transformations used by image augmentation algorithms like AugMix and ColOut), FSDP (Fully Sharded Data Parallel implementation for memory-efficient distributed training of large models), Pillow (Image processing library used by data augmentation algorithms to manipulate PIL Images before tensor conversion), NumPy (Numerical computing for algorithm implementations, particularly in data augmentation and mathematical transformations), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does composer have?

composer exhibits 4 data pools (Training State, Checkpoint Storage), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle recursive and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does composer use?

5 design patterns detected: Two-Way Callbacks, Module Surgery Registry, Event-Driven Architecture, Composable State Management, Functional Algorithm Interface.

How does composer compare to alternatives?

CodeSea has side-by-side architecture comparisons of composer with pytorch-lightning. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by .