mosaicml/composer

Supercharge Your Model Training

5,477 stars Python 9 components

Optimizes deep learning model training with efficient algorithms, distributed scaling, and performance monitoring

Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption.

Under the hood, the system uses 4 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.

A 9-component ml training. 397 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Initialize training infrastructure — Trainer creates State object with model, optimizers, data loaders, and initializes all algorithms and callbacks with their configurations. Sets up distributed training if multi-node, configures device placement, and establishes logging systems. [AlgorithmConfig → State] (config: parallelism.fsdp, parallelism.tp, training.device)
Apply structural algorithms — Engine dispatches INIT event, triggering algorithms like ChannelsLast, BlurPool, and ALiBi to perform model surgery. These algorithms use module_surgery to traverse the model and replace layers — ALiBi removes position embeddings and modifies attention, BlurPool adds anti-aliasing filters to convolutions. [State → State] (config: algorithms.alibi.max_sequence_length, algorithms.blurpool.replace_convs, algorithms.channels_last.enabled)
Load and transform training batch — DataSpec extracts batch from DataLoader and ensures proper device placement. Data augmentation algorithms like AugMix and ColOut check for BATCH_START event and apply their transformations to input tensors, modifying the batch in-place. [TrainingBatch → TrainingBatch] (config: data.batch_size, algorithms.augmix.severity, algorithms.colout.p_row)
Execute forward pass with monitoring — Engine dispatches BEFORE_FORWARD event, then model processes the batch. Profiler captures GPU memory usage and compute metrics. Engine dispatches AFTER_FORWARD event, allowing algorithms to examine or modify the outputs before loss computation. [TrainingBatch → State] (config: model.precision, profiler.enabled)
Compute loss and execute backward pass — Loss function processes model outputs and targets. Engine dispatches BEFORE_BACKWARD, then PyTorch autograd computes gradients. Algorithms can modify gradients or loss values. Engine dispatches AFTER_BACKWARD for gradient-related algorithms. [State → State] (config: training.loss, training.grad_clip_norm)
Update model parameters — Optimizer steps update model parameters using computed gradients. LR schedulers adjust learning rates based on training progress. Engine dispatches BATCH_END event for algorithms that need to track training statistics or modify parameters post-update. [State → State] (config: optimizer.lr, scheduler.warmup_steps)
Checkpoint and log metrics — At configured intervals, CheckpointSaver serializes complete training state including model weights, optimizer state, random seeds, and algorithm-specific data. Loggers record training metrics, performance statistics, and algorithm-specific measurements to configured backends. [State → CheckpointState] (config: checkpoint.save_interval, logging.log_level)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

State composer/core/state.py
Training state object containing model: ComposerModel, optimizers: list[Optimizer], train_dataloader: DataLoader, eval_dataloaders: dict, batch: Any, outputs: Any, loss: Tensor, timestamp: Timestamp, max_duration: Time, precision: Precision, device: Device, rank_zero_seed: int
Created at training start, continuously updated during training loop, passed to all algorithms and callbacks for inspection and modification

TrainingBatch composer/core/data_spec.py
Generic batch structure with inputs: Any (typically Tensor[B, ...]), targets: Any (typically Tensor[B] or Tensor[B, ...]), depending on task — vision uses Tensor[B, C, H, W] inputs, NLP uses token_ids: Tensor[B, seq_len]
Loaded from DataLoader, potentially transformed by data augmentation algorithms, fed to model forward pass, then used for loss computation

AlgorithmConfig composer/algorithms/*/
Algorithm-specific dataclasses like FSDPConfig(activation_checkpointing: bool, backward_prefetch: str, cpu_offload: bool, data_parallel_shard_degree: int), BlurPoolConfig(replace_convs: bool, min_channels: int), AugMixConfig(severity: int, width: int, alpha: float)
Parsed from config files or constructed programmatically, used to initialize Algorithm instances with specific hyperparameters

Event composer/core/event.py
Enum with values like INIT, EPOCH_START, BATCH_START, BEFORE_FORWARD, AFTER_FORWARD, BEFORE_BACKWARD, AFTER_BACKWARD, BATCH_END, EPOCH_END — each representing a specific point in the training timeline
Generated by the training engine at specific points in the training loop, used by algorithms to determine when to apply their transformations

CheckpointState composer/checkpoint/
Serializable dict containing state_dict: dict, model: dict, optimizers: list[dict], lr_schedulers: list[dict], algorithms: list[dict], callbacks: list[dict], timestamp: dict, rank_zero_seed: int, train_metrics: dict, eval_metrics: dict
Assembled from current training state, serialized to storage, later deserialized to restore exact training state including random number generator states

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

BERT attention modules always have exactly 'num_heads' attribute and query/key tensors with shape (batch, num_heads, seq_len, head_dim) where seq_len <= max_sequence_length

If this fails: If attention tensors exceed max_sequence_length or have different shapes (e.g., from different model variants), ALiBi bias computation silently produces wrong attention scores or crashes with dimension mismatch

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_attention_converter

critical Contract unguarded

All registered ALiBi replacement functions will be called with valid torch.nn.Module instances and return modules compatible with the original's interface

If this fails: If a replacement function returns None or a module with different forward() signature, training silently uses wrong attention mechanism or crashes with 'NoneType has no attribute' errors

composer/algorithms/alibi/attention_surgery_functions/utils.py:PolicyRegistry.register

critical Ordering unguarded

Model surgery must be applied before any optimizer state is created, since position embeddings are frozen and parameter counts change

If this fails: If called after optimizer initialization, optimizer state becomes misaligned with model parameters, causing training to diverge or crash with 'parameter count mismatch' errors

composer/algorithms/alibi/alibi.py:apply_alibi

critical Domain weakly guarded

GPT2Attention modules use causal attention (decoder-only) and have 'num_heads' attribute that divides evenly into the model's hidden dimension

If this fails: If applied to encoder-decoder models or models with non-standard attention shapes, ALiBi bias matrix has wrong causality or dimension, producing incorrect attention weights without error

composer/algorithms/alibi/attention_surgery_functions/_gpt2.py:gpt2_attention_converter

warning Scale unguarded

max_sequence_length parameter is reasonable for GPU memory (typically < 8192 tokens) and position_ids buffer fits in memory

If this fails: Very large max_sequence_length values (e.g., 1M tokens) cause OOM during buffer allocation or create inefficient attention computations that appear to hang

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter

warning Environment unguarded

All model parameters are on the same device and next(new_module.parameters()).device returns a valid device

If this fails: If model has no parameters or parameters are on different devices, position_ids buffer is created on wrong device, causing 'tensor on different device' errors during forward pass

composer/algorithms/alibi/attention_surgery_functions/_bert.py:bert_embedding_converter

warning Contract unguarded

Target modules have the expected position embedding attribute name (e.g., 'position_embeddings', 'wpe') and it's a torch.nn.Embedding layer

If this fails: If position embedding attribute doesn't exist or is not an Embedding, function silently does nothing or crashes with AttributeError, leaving positional information intact

composer/algorithms/alibi/attention_surgery_functions/utils.py:zero_and_freeze_expand_position_embeddings

warning Temporal unguarded

ALiBi algorithm only needs to run once during Event.INIT and the model structure remains static afterward

If this fails: If model architecture changes during training (e.g., dynamic layer addition), ALiBi modifications are lost and position embeddings may be re-enabled, degrading performance without warning

composer/algorithms/alibi/alibi.py:Alibi.match

info Domain unguarded

ALiBi bias slopes are computed using powers of 2 and the number of heads is compatible with the geometric progression pattern

If this fails: For unusual head counts or very large numbers of heads (>64), bias slopes become extremely small or large, causing numerical instability in attention scores

composer/algorithms/alibi/attention_surgery_functions/utils.py:register_alibi

info Contract weakly guarded

Algorithm classes imported in __init__.py have stable interfaces and their dependencies (like transformers library) are available at import time

If this fails: If transformers library is missing or incompatible version is installed, imports fail with MissingConditionalImportError even for users not using NLP algorithms

composer/algorithms/__init__.py

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Training State (state-store)
Central mutable state containing current model, batch, loss, timestamp, and algorithm-specific state that persists across training steps

Checkpoint Storage (file-store)
Persistent storage of complete training state snapshots enabling exact training resumption from any saved point

Algorithm Registry (registry)
Collection of initialized algorithm instances that match against events and apply transformations to training state

Performance Metrics Buffer (buffer)
Accumulates GPU utilization, memory usage, throughput measurements, and training statistics for performance analysis

Feedback Loops

Event-Algorithm Loop (recursive, reinforcing) — Trigger: Engine dispatches training event. Action: Each algorithm checks match() against current state and event, applies transformations if matched. Exit: All algorithms processed for current event.
Training Convergence Loop (training-loop, balancing) — Trigger: Training step completion. Action: Forward pass computes loss, backward pass computes gradients, optimizer updates parameters, loss feeds back to influence next iteration. Exit: Max duration reached or convergence criteria met.
Checkpointing Loop (scheduled-job, reinforcing) — Trigger: Checkpoint interval reached. Action: Serialize current training state, save to storage, continue training. Exit: Training completes.
Distributed Synchronization (backpressure, balancing) — Trigger: Gradient computation complete. Action: AllReduce gradients across workers, wait for all workers to complete before optimizer step. Exit: All workers synchronized.

Delays

Model Surgery Compilation (compilation, ~seconds to minutes) — One-time delay during INIT event while algorithms modify model architecture and PyTorch recompiles computation graphs
Distributed Gradient Sync (async-processing, ~milliseconds) — Workers wait for gradient AllReduce completion before optimizer step, creating synchronization barrier
Checkpoint Serialization (checkpoint-save, ~seconds) — Training pauses while complete state is serialized to disk, frequency controlled by save_interval
Data Loading Buffer (batch-window, ~variable) — DataLoader prefetching creates pipeline parallelism between data loading and model computation

Control Points

Algorithm Selection (feature-flag) — Controls: Which efficiency algorithms are active during training — each can be enabled/disabled independently. Default: configurable per algorithm
Precision Mode (precision-mode) — Controls: Model computation precision (fp32, fp16, bf16) affecting memory usage and training speed. Default: fp32
FSDP Sharding Strategy (architecture-switch) — Controls: How model parameters are sharded across workers — full_shard, shard_grad_op, or no_shard. Default: full_shard
Activation Checkpointing (hyperparameter) — Controls: Whether to trade compute for memory by recomputing activations during backward pass. Default: False
Backward Prefetch Strategy (performance-toggle) — Controls: When to prefetch parameters during backward pass — BACKWARD_PRE or BACKWARD_POST. Default: BACKWARD_POST
Device Selection (device-selection) — Controls: Which compute device (CPU, GPU, TPU) to use for training. Default: cuda

Technology Stack

PyTorch (framework)
Core deep learning framework providing model definition, autograd, and distributed training primitives

Transformers (library)
Provides pre-trained model architectures (BERT, GPT-2, RoBERTa) that algorithms like ALiBi modify through model surgery

TorchVision (library)
Computer vision utilities and transformations used by image augmentation algorithms like AugMix and ColOut

FSDP (runtime)
Fully Sharded Data Parallel implementation for memory-efficient distributed training of large models

Pillow (library)
Image processing library used by data augmentation algorithms to manipulate PIL Images before tensor conversion

NumPy (library)
Numerical computing for algorithm implementations, particularly in data augmentation and mathematical transformations

Pytest (testing)
Testing framework with custom markers for GPU tests, distributed tests, and algorithm-specific test configurations

Setuptools (build)
Build system for packaging and distribution of the Composer library

Key Components

Trainer (orchestrator) — Central coordinator that manages the complete training lifecycle — initializes models and optimizers, runs the training loop with event dispatch, handles distributed training setup, and coordinates checkpointing and evaluation composer/trainer/trainer.py
Engine (executor) — Core training loop executor that dispatches events at specific training milestones, allowing algorithms and callbacks to hook into the training process and modify behavior composer/core/engine.py
Algorithm (transformer) — Base class for efficiency algorithms that implement match() to detect when to run and apply() to modify the training state — enables composable training optimizations composer/core/algorithm.py
PolicyRegistry (registry) — Maps PyTorch module types to their ALiBi replacement functions, enabling automatic model surgery that replaces attention mechanisms with ALiBi-compatible versions composer/algorithms/alibi/attention_surgery_functions/utils.py
module_surgery (transformer) — Runtime model modification system that traverses module hierarchies and replaces matching modules with optimized versions — used by algorithms like BlurPool and ALiBi composer/utils/module_surgery.py
DataSpec (adapter) — Bridges different data loading patterns by wrapping DataLoaders and defining how to extract device-compatible batches, enabling consistent data handling across different model types composer/core/data_spec.py
CheckpointSaver (serializer) — Handles the serialization of complete training state to persistent storage, including model weights, optimizer state, random seeds, and algorithm-specific state for exact resumption composer/checkpoint/checkpoint_saver.py
FSDPConfig (factory) — Configuration factory for Fully Sharded Data Parallel setup with parameters like activation_checkpointing, backward_prefetch, cpu_offload controlling memory and communication optimization composer/utils/parallelism.py
Profiler (monitor) — Comprehensive performance monitoring system that tracks GPU utilization, memory usage, throughput metrics, and generates detailed performance reports for optimization composer/profiler/

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare composer

Related Ml Training Repositories

Frequently Asked Questions

What is composer used for?

Optimizes deep learning model training with efficient algorithms, distributed scaling, and performance monitoring mosaicml/composer is a 9-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 397 files.

How is composer architected?

composer is organized into 5 architecture layers: Training Orchestration, Algorithm Registry, Model Surgery, Distributed Infrastructure, and 1 more. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through composer?

Data moves through 7 stages: Initialize training infrastructure → Apply structural algorithms → Load and transform training batch → Execute forward pass with monitoring → Compute loss and execute backward pass → .... Data enters through DataLoaders wrapped by DataSpec objects that standardize batch format. The Trainer initializes the training state and starts the Engine, which dispatches Events at training milestones. At each event, Algorithms check if they match the current state and event, then apply transformations to the model, data, or training parameters. The modified state flows through the standard PyTorch training loop (forward pass, loss computation, backward pass, optimizer step) while Profilers and Loggers capture metrics. Checkpoints serialize the complete state including algorithm-specific information for later resumption. This pipeline design reflects a complex multi-stage processing system.

What technologies does composer use?

The core stack includes PyTorch (Core deep learning framework providing model definition, autograd, and distributed training primitives), Transformers (Provides pre-trained model architectures (BERT, GPT-2, RoBERTa) that algorithms like ALiBi modify through model surgery), TorchVision (Computer vision utilities and transformations used by image augmentation algorithms like AugMix and ColOut), FSDP (Fully Sharded Data Parallel implementation for memory-efficient distributed training of large models), Pillow (Image processing library used by data augmentation algorithms to manipulate PIL Images before tensor conversion), NumPy (Numerical computing for algorithm implementations, particularly in data augmentation and mathematical transformations), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does composer have?

composer exhibits 4 data pools (Training State, Checkpoint Storage), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle recursive and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does composer use?

5 design patterns detected: Two-Way Callbacks, Module Surgery Registry, Event-Driven Architecture, Composable State Management, Functional Algorithm Interface.

How does composer compare to alternatives?

CodeSea has side-by-side architecture comparisons of composer with pytorch-lightning. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

mosaicml/composer

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Explore the interactive analysis

Compare composer

composer vs Pytorch Lightning

Related Ml Training Repositories

tensorflow/tensorflow

automatic1111/stable-diffusion-webui

huggingface/transformers

ggml-org/llama.cpp

pytorch/pytorch

openai/whisper

Frequently Asked Questions