huggingface/accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

9,614 stars Python 6 components

Makes PyTorch training scripts work seamlessly across single GPU, multi-GPU, and distributed setups

Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically.

Under the hood, the system uses 2 feedback loops, 2 data pools, 4 control points to manage its runtime behavior.

A 6-component library. 197 files analyzed. Data flows through 4 distinct pipeline stages.

How Data Flows Through the System

Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically.

  1. Initialize distributed environment — AcceleratorState singleton detects available hardware (GPUs, TPUs), determines distributed backend (DDP, FSDP, DeepSpeed), and sets up process groups for multi-device communication (config: compute_environment, distributed_type, num_processes +1)
  2. Prepare PyTorch objects — The prepare() method wraps user's model, optimizer, and dataloader with backend-specific implementations — applying FSDP sharding, DeepSpeed ZeRO partitioning, or DDP replication based on configuration [AcceleratorState → TrainingBatch] (config: mixed_precision, use_cpu, gpu_ids)
  3. Execute training step — Forward pass runs through prepared model with automatic mixed precision, backward() call handles gradient synchronization across devices, and optimizer step updates sharded parameters correctly [TrainingBatch → TrainingBatch] (config: mixed_precision)
  4. Track memory usage — MemoryTracker background thread continuously samples GPU and CPU memory at configurable intervals, storing timestamps and usage data for post-training analysis [TrainingBatch → MemorySnapshot]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

AcceleratorState src/accelerate/state.py
singleton containing distributed_type: DistributedType, device: torch.device, num_processes: int, process_index: int, mixed_precision: str, and backend-specific configuration
Created once per process during Accelerator initialization, holds global training configuration throughout execution
TrainingBatch benchmarks/fsdp2/utils.py
dict with input_ids: torch.Tensor[batch_size, seq_len], attention_mask: torch.Tensor[batch_size, seq_len], labels: torch.Tensor[batch_size, seq_len] for language modeling tasks
Created by DataLoader from tokenized datasets, flows through model forward/backward, then garbage collected
FP8RecipeKwargs src/accelerate/utils/dataclasses.py
dataclass with backend: str, opt_level: str, margin: int, interval: int, fp8_format: str controlling fp8 mixed precision behavior
Built from user config, passed to backend-specific fp8 conversion functions during model preparation
MemorySnapshot benchmarks/fsdp2/measure_utils.py
dict with timestamps: list[float], gpu_memory: list[int], cpu_memory: list[int] tracking memory usage over time
Continuously updated by background thread during training, saved to JSON for analysis after completion

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

model(**batch, use_cache=False) expects batch to be a dictionary with specific keys like 'input_ids', 'attention_mask', and 'labels' that match the model's forward signature

If this fails: If batch contains unexpected keys or missing required keys, the model forward pass will fail with cryptic KeyError or TypeError, making debugging difficult across distributed processes

benchmarks/fsdp2/main.py:train
critical Domain unguarded

process.memory_info().rss returns memory in bytes and that this measurement accurately represents peak usage for memory profiling

If this fails: If rss is in different units on some systems or doesn't include all relevant memory (like shared libraries), memory reports will be wrong by orders of magnitude, leading to incorrect capacity planning

benchmarks/big_model_inference/measures_util.py:PeakCPUMemory.peak_monitor
critical Ordering unguarded

msamp.initialize() must be called before moving model to device, and that prepare() order doesn't matter for correct FP8 setup

If this fails: If model is moved to CUDA before msamp.initialize(), FP8 conversion may fail silently or produce incorrect gradients, leading to training divergence without clear error messages

benchmarks/fp8/ms_amp/ddp.py:train_baseline
warning Resource unguarded

Models like 'facebook/opt-30b' and 'EleutherAI/gpt-neox-20b' can be loaded with available system memory and that sharded versions exist at the specified paths

If this fails: If system has insufficient RAM (30GB+ for opt-30b) or if sharded model files don't exist at expected HuggingFace Hub locations, the benchmark crashes with OOM or 404 errors instead of graceful fallback

benchmarks/big_model_inference/big_model_inference.py:DEFAULT_MODELS
warning Contract unguarded

AutoTokenizer.from_pretrained() will return a tokenizer that accepts sentence1/sentence2 pairs and that examples dict contains these exact keys

If this fails: If the dataset format changes or tokenizer doesn't support pair encoding, tokenization fails with KeyError during dataset preprocessing, breaking all FP8 benchmarks that depend on this utility

benchmarks/fp8/ms_amp/fp8_utils.py:tokenize_function
warning Environment unguarded

TransformerEngine is properly installed with CUDA support and that te.recipe.DelayedScaling works with the specific PyTorch version

If this fails: If TE is compiled without CUDA or with incompatible CUDA version, FP8 operations silently fall back to FP32, making performance comparisons meaningless while appearing to succeed

benchmarks/fp8/transformer_engine/ddp.py:train_baseline
warning Temporal weakly guarded

CPU memory monitoring can capture true peak usage without sleep() and that memory doesn't change faster than the monitoring loop can sample

If this fails: If memory spikes occur between samples or if the tight loop affects system performance, peak measurements will be inaccurate, leading to wrong memory optimization decisions

benchmarks/big_model_inference/measures_util.py:peak_monitor
info Scale unguarded

BERT-base-cased model size (110M parameters) fits comfortably in memory for FSDP sharding tests and represents realistic FP8 performance characteristics

If this fails: If testing on larger models without adjusting batch size or memory settings, FSDP may shard too aggressively or run out of memory, giving misleading FP8 vs baseline comparisons

benchmarks/fp8/torchao/fsdp.py:MODEL_NAME
info Domain unguarded

Function signature expects first_layer_name and last_layer_name parameters to control which linear layers get FP8 conversion, matching torchao's API expectations

If this fails: If torchao changes its layer filtering API or if model architecture doesn't match expected naming conventions, FP8 conversion may skip intended layers, reducing expected performance benefits

benchmarks/fp8/torchao/distrib_deepspeed.py:filter_linear_layers
info Contract unguarded

init_fn callable returns exactly 5 objects (model, optimizer, dataloader, accelerator, memory_tracker) in that specific order

If this fails: If init_fn returns different number of values or reorders them, tuple unpacking fails with ValueError, breaking the benchmark evaluation framework

benchmarks/fsdp2/main.py:evaluate

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Global AcceleratorState (state-store)
Singleton that persists distributed training configuration across all components during process lifetime
Memory usage buffer (buffer)
Circular buffer that accumulates memory samples from background monitoring thread for analysis and visualization

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Core ML framework providing model definitions, distributed training primitives (DDP, FSDP), and automatic differentiation that Accelerate wraps
DeepSpeed (library)
Memory-efficient training backend for large models using ZeRO parameter partitioning and gradient/optimizer state sharding
TransformerEngine (compute)
NVIDIA library for fp8 mixed precision training, replacing PyTorch Linear layers with optimized 8-bit implementations
Transformers (library)
Provides pre-trained model architectures (BERT, GPT, etc.) and tokenizers used in benchmarks and examples
psutil (library)
System monitoring for CPU memory usage tracking during training profiling and resource analysis
Matplotlib (library)
Visualization of memory usage traces and performance benchmarks from training runs

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Library Repositories

Frequently Asked Questions

What is accelerate used for?

Makes PyTorch training scripts work seamlessly across single GPU, multi-GPU, and distributed setups huggingface/accelerate is a 6-component library written in Python. Data flows through 4 distinct pipeline stages. The codebase contains 197 files.

How is accelerate architected?

accelerate is organized into 3 architecture layers: Orchestration Layer, Backend Adapters, Device Utilities. Data flows through 4 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through accelerate?

Data moves through 4 stages: Initialize distributed environment → Prepare PyTorch objects → Execute training step → Track memory usage. Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically. This pipeline design keeps the data transformation process straightforward.

What technologies does accelerate use?

The core stack includes PyTorch (Core ML framework providing model definitions, distributed training primitives (DDP, FSDP), and automatic differentiation that Accelerate wraps), DeepSpeed (Memory-efficient training backend for large models using ZeRO parameter partitioning and gradient/optimizer state sharding), TransformerEngine (NVIDIA library for fp8 mixed precision training, replacing PyTorch Linear layers with optimized 8-bit implementations), Transformers (Provides pre-trained model architectures (BERT, GPT, etc.) and tokenizers used in benchmarks and examples), psutil (System monitoring for CPU memory usage tracking during training profiling and resource analysis), Matplotlib (Visualization of memory usage traces and performance benchmarks from training runs). A focused set of dependencies that keeps the build manageable.

What system dynamics does accelerate have?

accelerate exhibits 2 data pools (Global AcceleratorState, Memory usage buffer), 2 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does accelerate use?

4 design patterns detected: Adapter Pattern for Backend Abstraction, Singleton State Management, Context Manager for Resource Control, Background Monitoring with Threading.

Analyzed on April 20, 2026 by CodeSea. Written by .