huggingface/accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

9,614 stars Python 6 components

Makes PyTorch training scripts work seamlessly across single GPU, multi-GPU, and distributed setups

Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically.

Under the hood, the system uses 2 feedback loops, 2 data pools, 4 control points to manage its runtime behavior.

A 6-component library. 197 files analyzed. Data flows through 4 distinct pipeline stages.

How Data Flows Through the System

Initialize distributed environment — AcceleratorState singleton detects available hardware (GPUs, TPUs), determines distributed backend (DDP, FSDP, DeepSpeed), and sets up process groups for multi-device communication (config: compute_environment, distributed_type, num_processes +1)
Prepare PyTorch objects — The prepare() method wraps user's model, optimizer, and dataloader with backend-specific implementations — applying FSDP sharding, DeepSpeed ZeRO partitioning, or DDP replication based on configuration [AcceleratorState → TrainingBatch] (config: mixed_precision, use_cpu, gpu_ids)
Execute training step — Forward pass runs through prepared model with automatic mixed precision, backward() call handles gradient synchronization across devices, and optimizer step updates sharded parameters correctly [TrainingBatch → TrainingBatch] (config: mixed_precision)
Track memory usage — MemoryTracker background thread continuously samples GPU and CPU memory at configurable intervals, storing timestamps and usage data for post-training analysis [TrainingBatch → MemorySnapshot]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

AcceleratorState src/accelerate/state.py
singleton containing distributed_type: DistributedType, device: torch.device, num_processes: int, process_index: int, mixed_precision: str, and backend-specific configuration
Created once per process during Accelerator initialization, holds global training configuration throughout execution

TrainingBatch benchmarks/fsdp2/utils.py
dict with input_ids: torch.Tensor[batch_size, seq_len], attention_mask: torch.Tensor[batch_size, seq_len], labels: torch.Tensor[batch_size, seq_len] for language modeling tasks
Created by DataLoader from tokenized datasets, flows through model forward/backward, then garbage collected

FP8RecipeKwargs src/accelerate/utils/dataclasses.py
dataclass with backend: str, opt_level: str, margin: int, interval: int, fp8_format: str controlling fp8 mixed precision behavior
Built from user config, passed to backend-specific fp8 conversion functions during model preparation

MemorySnapshot benchmarks/fsdp2/measure_utils.py
dict with timestamps: list[float], gpu_memory: list[int], cpu_memory: list[int] tracking memory usage over time
Continuously updated by background thread during training, saved to JSON for analysis after completion

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

model(**batch, use_cache=False) expects batch to be a dictionary with specific keys like 'input_ids', 'attention_mask', and 'labels' that match the model's forward signature

If this fails: If batch contains unexpected keys or missing required keys, the model forward pass will fail with cryptic KeyError or TypeError, making debugging difficult across distributed processes

benchmarks/fsdp2/main.py:train

critical Domain unguarded

process.memory_info().rss returns memory in bytes and that this measurement accurately represents peak usage for memory profiling

If this fails: If rss is in different units on some systems or doesn't include all relevant memory (like shared libraries), memory reports will be wrong by orders of magnitude, leading to incorrect capacity planning

benchmarks/big_model_inference/measures_util.py:PeakCPUMemory.peak_monitor

critical Ordering unguarded

msamp.initialize() must be called before moving model to device, and that prepare() order doesn't matter for correct FP8 setup

If this fails: If model is moved to CUDA before msamp.initialize(), FP8 conversion may fail silently or produce incorrect gradients, leading to training divergence without clear error messages

benchmarks/fp8/ms_amp/ddp.py:train_baseline

warning Resource unguarded

Models like 'facebook/opt-30b' and 'EleutherAI/gpt-neox-20b' can be loaded with available system memory and that sharded versions exist at the specified paths

If this fails: If system has insufficient RAM (30GB+ for opt-30b) or if sharded model files don't exist at expected HuggingFace Hub locations, the benchmark crashes with OOM or 404 errors instead of graceful fallback

benchmarks/big_model_inference/big_model_inference.py:DEFAULT_MODELS

warning Contract unguarded

AutoTokenizer.from_pretrained() will return a tokenizer that accepts sentence1/sentence2 pairs and that examples dict contains these exact keys

If this fails: If the dataset format changes or tokenizer doesn't support pair encoding, tokenization fails with KeyError during dataset preprocessing, breaking all FP8 benchmarks that depend on this utility

benchmarks/fp8/ms_amp/fp8_utils.py:tokenize_function

warning Environment unguarded

TransformerEngine is properly installed with CUDA support and that te.recipe.DelayedScaling works with the specific PyTorch version

If this fails: If TE is compiled without CUDA or with incompatible CUDA version, FP8 operations silently fall back to FP32, making performance comparisons meaningless while appearing to succeed

benchmarks/fp8/transformer_engine/ddp.py:train_baseline

warning Temporal weakly guarded

CPU memory monitoring can capture true peak usage without sleep() and that memory doesn't change faster than the monitoring loop can sample

If this fails: If memory spikes occur between samples or if the tight loop affects system performance, peak measurements will be inaccurate, leading to wrong memory optimization decisions

benchmarks/big_model_inference/measures_util.py:peak_monitor

info Scale unguarded

BERT-base-cased model size (110M parameters) fits comfortably in memory for FSDP sharding tests and represents realistic FP8 performance characteristics

If this fails: If testing on larger models without adjusting batch size or memory settings, FSDP may shard too aggressively or run out of memory, giving misleading FP8 vs baseline comparisons

benchmarks/fp8/torchao/fsdp.py:MODEL_NAME

info Domain unguarded

Function signature expects first_layer_name and last_layer_name parameters to control which linear layers get FP8 conversion, matching torchao's API expectations

If this fails: If torchao changes its layer filtering API or if model architecture doesn't match expected naming conventions, FP8 conversion may skip intended layers, reducing expected performance benefits

benchmarks/fp8/torchao/distrib_deepspeed.py:filter_linear_layers

info Contract unguarded

init_fn callable returns exactly 5 objects (model, optimizer, dataloader, accelerator, memory_tracker) in that specific order

If this fails: If init_fn returns different number of values or reorders them, tuple unpacking fails with ValueError, breaking the benchmark evaluation framework

benchmarks/fsdp2/main.py:evaluate

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Global AcceleratorState (state-store)
Singleton that persists distributed training configuration across all components during process lifetime

Memory usage buffer (buffer)
Circular buffer that accumulates memory samples from background monitoring thread for analysis and visualization

Feedback Loops

Training loop with gradient accumulation (training-loop, reinforcing) — Trigger: User calls backward() multiple times before optimizer step. Action: Gradients accumulate across microbatches without synchronization until accelerator.sync_gradients context manager triggers all-reduce. Exit: Configured gradient accumulation steps reached.
Dynamic loss scaling (auto-scale, balancing) — Trigger: Gradient overflow detected in mixed precision training. Action: Loss scale decreases and optimizer step is skipped, preventing parameter corruption from infinite gradients. Exit: Stable gradients for consecutive steps allow scale increase.

Delays

Initial model sharding (compilation, ~5-30 seconds) — FSDP wraps each transformer layer, analyzing parameter dependencies and distributing weights across devices before first forward pass
Gradient synchronization (async-processing, ~varies with model size) — All-reduce communication happens asynchronously during backward pass, overlapping computation with network traffic
Memory snapshot capture (checkpoint-save, ~1-5 seconds) — PyTorch memory profiler captures detailed allocation traces, temporarily pausing execution to write snapshot files

Control Points

mixed_precision (precision-mode) — Controls: Switches between fp32, fp16, bf16, or fp8 computation modes, affecting memory usage and training speed. Default: no
use_cpu (device-selection) — Controls: Forces training on CPU instead of available accelerators, useful for debugging or memory-constrained scenarios. Default: false
num_processes (architecture-switch) — Controls: Number of parallel processes for distributed training, typically matches number of GPUs. Default: -1
gradient_accumulation_steps (hyperparameter) — Controls: Number of forward/backward passes before gradient synchronization and optimizer step, enabling larger effective batch sizes. Default: 1

Technology Stack

PyTorch (framework)
Core ML framework providing model definitions, distributed training primitives (DDP, FSDP), and automatic differentiation that Accelerate wraps

DeepSpeed (library)
Memory-efficient training backend for large models using ZeRO parameter partitioning and gradient/optimizer state sharding

TransformerEngine (compute)
NVIDIA library for fp8 mixed precision training, replacing PyTorch Linear layers with optimized 8-bit implementations

Transformers (library)
Provides pre-trained model architectures (BERT, GPT, etc.) and tokenizers used in benchmarks and examples

psutil (library)
System monitoring for CPU memory usage tracking during training profiling and resource analysis

Matplotlib (library)
Visualization of memory usage traces and performance benchmarks from training runs

Key Components

Accelerator (orchestrator) — Main coordination class that detects hardware configuration, wraps PyTorch objects with appropriate distributed backends, and provides unified interface for training operations src/accelerate/accelerator.py
AcceleratorState (registry) — Singleton that holds global distributed training configuration including device type, process rank, world size, and backend settings src/accelerate/state.py
MemoryTracker (monitor) — Background thread that continuously samples GPU and CPU memory usage during training, providing detailed memory profiling for optimization analysis benchmarks/fsdp2/measure_utils.py
FullyShardedDataParallelPlugin (adapter) — Configures PyTorch FSDP settings including sharding strategy, mixed precision policy, and CPU offloading based on user configuration src/accelerate/utils/fsdp_utils.py
DeepSpeedPlugin (adapter) — Manages DeepSpeed ZeRO configuration, parameter partitioning, and optimizer state sharding for memory-efficient large model training src/accelerate/utils/deepspeed.py
convert_model (transformer) — Replaces PyTorch Linear layers with TransformerEngine fp8 equivalents, enabling automatic mixed precision training with 8-bit floating point src/accelerate/utils/transformer_engine.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is accelerate used for?

Makes PyTorch training scripts work seamlessly across single GPU, multi-GPU, and distributed setups huggingface/accelerate is a 6-component library written in Python. Data flows through 4 distinct pipeline stages. The codebase contains 197 files.

How is accelerate architected?

accelerate is organized into 3 architecture layers: Orchestration Layer, Backend Adapters, Device Utilities. Data flows through 4 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through accelerate?

Data moves through 4 stages: Initialize distributed environment → Prepare PyTorch objects → Execute training step → Track memory usage. Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically. This pipeline design keeps the data transformation process straightforward.

What technologies does accelerate use?

The core stack includes PyTorch (Core ML framework providing model definitions, distributed training primitives (DDP, FSDP), and automatic differentiation that Accelerate wraps), DeepSpeed (Memory-efficient training backend for large models using ZeRO parameter partitioning and gradient/optimizer state sharding), TransformerEngine (NVIDIA library for fp8 mixed precision training, replacing PyTorch Linear layers with optimized 8-bit implementations), Transformers (Provides pre-trained model architectures (BERT, GPT, etc.) and tokenizers used in benchmarks and examples), psutil (System monitoring for CPU memory usage tracking during training profiling and resource analysis), Matplotlib (Visualization of memory usage traces and performance benchmarks from training runs). A focused set of dependencies that keeps the build manageable.

What system dynamics does accelerate have?

accelerate exhibits 2 data pools (Global AcceleratorState, Memory usage buffer), 2 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does accelerate use?

4 design patterns detected: Adapter Pattern for Backend Abstraction, Singleton State Management, Context Manager for Resource Control, Background Monitoring with Threading.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.