huggingface/accelerate
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
Makes PyTorch training scripts work seamlessly across single GPU, multi-GPU, and distributed setups
Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically.
Under the hood, the system uses 2 feedback loops, 2 data pools, 4 control points to manage its runtime behavior.
A 6-component library. 197 files analyzed. Data flows through 4 distinct pipeline stages.
How Data Flows Through the System
Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically.
- Initialize distributed environment — AcceleratorState singleton detects available hardware (GPUs, TPUs), determines distributed backend (DDP, FSDP, DeepSpeed), and sets up process groups for multi-device communication (config: compute_environment, distributed_type, num_processes +1)
- Prepare PyTorch objects — The prepare() method wraps user's model, optimizer, and dataloader with backend-specific implementations — applying FSDP sharding, DeepSpeed ZeRO partitioning, or DDP replication based on configuration [AcceleratorState → TrainingBatch] (config: mixed_precision, use_cpu, gpu_ids)
- Execute training step — Forward pass runs through prepared model with automatic mixed precision, backward() call handles gradient synchronization across devices, and optimizer step updates sharded parameters correctly [TrainingBatch → TrainingBatch] (config: mixed_precision)
- Track memory usage — MemoryTracker background thread continuously samples GPU and CPU memory at configurable intervals, storing timestamps and usage data for post-training analysis [TrainingBatch → MemorySnapshot]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
src/accelerate/state.pysingleton containing distributed_type: DistributedType, device: torch.device, num_processes: int, process_index: int, mixed_precision: str, and backend-specific configuration
Created once per process during Accelerator initialization, holds global training configuration throughout execution
benchmarks/fsdp2/utils.pydict with input_ids: torch.Tensor[batch_size, seq_len], attention_mask: torch.Tensor[batch_size, seq_len], labels: torch.Tensor[batch_size, seq_len] for language modeling tasks
Created by DataLoader from tokenized datasets, flows through model forward/backward, then garbage collected
src/accelerate/utils/dataclasses.pydataclass with backend: str, opt_level: str, margin: int, interval: int, fp8_format: str controlling fp8 mixed precision behavior
Built from user config, passed to backend-specific fp8 conversion functions during model preparation
benchmarks/fsdp2/measure_utils.pydict with timestamps: list[float], gpu_memory: list[int], cpu_memory: list[int] tracking memory usage over time
Continuously updated by background thread during training, saved to JSON for analysis after completion
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
model(**batch, use_cache=False) expects batch to be a dictionary with specific keys like 'input_ids', 'attention_mask', and 'labels' that match the model's forward signature
If this fails: If batch contains unexpected keys or missing required keys, the model forward pass will fail with cryptic KeyError or TypeError, making debugging difficult across distributed processes
benchmarks/fsdp2/main.py:train
process.memory_info().rss returns memory in bytes and that this measurement accurately represents peak usage for memory profiling
If this fails: If rss is in different units on some systems or doesn't include all relevant memory (like shared libraries), memory reports will be wrong by orders of magnitude, leading to incorrect capacity planning
benchmarks/big_model_inference/measures_util.py:PeakCPUMemory.peak_monitor
msamp.initialize() must be called before moving model to device, and that prepare() order doesn't matter for correct FP8 setup
If this fails: If model is moved to CUDA before msamp.initialize(), FP8 conversion may fail silently or produce incorrect gradients, leading to training divergence without clear error messages
benchmarks/fp8/ms_amp/ddp.py:train_baseline
Models like 'facebook/opt-30b' and 'EleutherAI/gpt-neox-20b' can be loaded with available system memory and that sharded versions exist at the specified paths
If this fails: If system has insufficient RAM (30GB+ for opt-30b) or if sharded model files don't exist at expected HuggingFace Hub locations, the benchmark crashes with OOM or 404 errors instead of graceful fallback
benchmarks/big_model_inference/big_model_inference.py:DEFAULT_MODELS
AutoTokenizer.from_pretrained() will return a tokenizer that accepts sentence1/sentence2 pairs and that examples dict contains these exact keys
If this fails: If the dataset format changes or tokenizer doesn't support pair encoding, tokenization fails with KeyError during dataset preprocessing, breaking all FP8 benchmarks that depend on this utility
benchmarks/fp8/ms_amp/fp8_utils.py:tokenize_function
TransformerEngine is properly installed with CUDA support and that te.recipe.DelayedScaling works with the specific PyTorch version
If this fails: If TE is compiled without CUDA or with incompatible CUDA version, FP8 operations silently fall back to FP32, making performance comparisons meaningless while appearing to succeed
benchmarks/fp8/transformer_engine/ddp.py:train_baseline
CPU memory monitoring can capture true peak usage without sleep() and that memory doesn't change faster than the monitoring loop can sample
If this fails: If memory spikes occur between samples or if the tight loop affects system performance, peak measurements will be inaccurate, leading to wrong memory optimization decisions
benchmarks/big_model_inference/measures_util.py:peak_monitor
BERT-base-cased model size (110M parameters) fits comfortably in memory for FSDP sharding tests and represents realistic FP8 performance characteristics
If this fails: If testing on larger models without adjusting batch size or memory settings, FSDP may shard too aggressively or run out of memory, giving misleading FP8 vs baseline comparisons
benchmarks/fp8/torchao/fsdp.py:MODEL_NAME
Function signature expects first_layer_name and last_layer_name parameters to control which linear layers get FP8 conversion, matching torchao's API expectations
If this fails: If torchao changes its layer filtering API or if model architecture doesn't match expected naming conventions, FP8 conversion may skip intended layers, reducing expected performance benefits
benchmarks/fp8/torchao/distrib_deepspeed.py:filter_linear_layers
init_fn callable returns exactly 5 objects (model, optimizer, dataloader, accelerator, memory_tracker) in that specific order
If this fails: If init_fn returns different number of values or reorders them, tuple unpacking fails with ValueError, breaking the benchmark evaluation framework
benchmarks/fsdp2/main.py:evaluate
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Singleton that persists distributed training configuration across all components during process lifetime
Circular buffer that accumulates memory samples from background monitoring thread for analysis and visualization
Feedback Loops
- Training loop with gradient accumulation (training-loop, reinforcing) — Trigger: User calls backward() multiple times before optimizer step. Action: Gradients accumulate across microbatches without synchronization until accelerator.sync_gradients context manager triggers all-reduce. Exit: Configured gradient accumulation steps reached.
- Dynamic loss scaling (auto-scale, balancing) — Trigger: Gradient overflow detected in mixed precision training. Action: Loss scale decreases and optimizer step is skipped, preventing parameter corruption from infinite gradients. Exit: Stable gradients for consecutive steps allow scale increase.
Delays
- Initial model sharding (compilation, ~5-30 seconds) — FSDP wraps each transformer layer, analyzing parameter dependencies and distributing weights across devices before first forward pass
- Gradient synchronization (async-processing, ~varies with model size) — All-reduce communication happens asynchronously during backward pass, overlapping computation with network traffic
- Memory snapshot capture (checkpoint-save, ~1-5 seconds) — PyTorch memory profiler captures detailed allocation traces, temporarily pausing execution to write snapshot files
Control Points
- mixed_precision (precision-mode) — Controls: Switches between fp32, fp16, bf16, or fp8 computation modes, affecting memory usage and training speed. Default: no
- use_cpu (device-selection) — Controls: Forces training on CPU instead of available accelerators, useful for debugging or memory-constrained scenarios. Default: false
- num_processes (architecture-switch) — Controls: Number of parallel processes for distributed training, typically matches number of GPUs. Default: -1
- gradient_accumulation_steps (hyperparameter) — Controls: Number of forward/backward passes before gradient synchronization and optimizer step, enabling larger effective batch sizes. Default: 1
Technology Stack
Core ML framework providing model definitions, distributed training primitives (DDP, FSDP), and automatic differentiation that Accelerate wraps
Memory-efficient training backend for large models using ZeRO parameter partitioning and gradient/optimizer state sharding
NVIDIA library for fp8 mixed precision training, replacing PyTorch Linear layers with optimized 8-bit implementations
Provides pre-trained model architectures (BERT, GPT, etc.) and tokenizers used in benchmarks and examples
System monitoring for CPU memory usage tracking during training profiling and resource analysis
Visualization of memory usage traces and performance benchmarks from training runs
Key Components
- Accelerator (orchestrator) — Main coordination class that detects hardware configuration, wraps PyTorch objects with appropriate distributed backends, and provides unified interface for training operations
src/accelerate/accelerator.py - AcceleratorState (registry) — Singleton that holds global distributed training configuration including device type, process rank, world size, and backend settings
src/accelerate/state.py - MemoryTracker (monitor) — Background thread that continuously samples GPU and CPU memory usage during training, providing detailed memory profiling for optimization analysis
benchmarks/fsdp2/measure_utils.py - FullyShardedDataParallelPlugin (adapter) — Configures PyTorch FSDP settings including sharding strategy, mixed precision policy, and CPU offloading based on user configuration
src/accelerate/utils/fsdp_utils.py - DeepSpeedPlugin (adapter) — Manages DeepSpeed ZeRO configuration, parameter partitioning, and optimizer state sharding for memory-efficient large model training
src/accelerate/utils/deepspeed.py - convert_model (transformer) — Replaces PyTorch Linear layers with TransformerEngine fp8 equivalents, enabling automatic mixed precision training with 8-bit floating point
src/accelerate/utils/transformer_engine.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Library Repositories
Frequently Asked Questions
What is accelerate used for?
Makes PyTorch training scripts work seamlessly across single GPU, multi-GPU, and distributed setups huggingface/accelerate is a 6-component library written in Python. Data flows through 4 distinct pipeline stages. The codebase contains 197 files.
How is accelerate architected?
accelerate is organized into 3 architecture layers: Orchestration Layer, Backend Adapters, Device Utilities. Data flows through 4 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through accelerate?
Data moves through 4 stages: Initialize distributed environment → Prepare PyTorch objects → Execute training step → Track memory usage. Training begins when user instantiates Accelerator, which detects hardware and creates AcceleratorState singleton. User calls accelerator.prepare() on their PyTorch model, optimizer, and dataloader — this wraps them with appropriate distributed backends (DDP, FSDP, DeepSpeed) and mixed precision handlers (fp16, bf16, fp8). During training, batches flow from prepared dataloader through the wrapped model, with Accelerator handling gradient synchronization and precision conversions automatically. This pipeline design keeps the data transformation process straightforward.
What technologies does accelerate use?
The core stack includes PyTorch (Core ML framework providing model definitions, distributed training primitives (DDP, FSDP), and automatic differentiation that Accelerate wraps), DeepSpeed (Memory-efficient training backend for large models using ZeRO parameter partitioning and gradient/optimizer state sharding), TransformerEngine (NVIDIA library for fp8 mixed precision training, replacing PyTorch Linear layers with optimized 8-bit implementations), Transformers (Provides pre-trained model architectures (BERT, GPT, etc.) and tokenizers used in benchmarks and examples), psutil (System monitoring for CPU memory usage tracking during training profiling and resource analysis), Matplotlib (Visualization of memory usage traces and performance benchmarks from training runs). A focused set of dependencies that keeps the build manageable.
What system dynamics does accelerate have?
accelerate exhibits 2 data pools (Global AcceleratorState, Memory usage buffer), 2 feedback loops, 4 control points, 3 delays. The feedback loops handle training-loop and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does accelerate use?
4 design patterns detected: Adapter Pattern for Backend Abstraction, Singleton State Management, Context Manager for Resource Control, Background Monitoring with Threading.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.