huggingface/peft
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Applies parameter-efficient fine-tuning methods to pretrained models using various adapter architectures
Users create a PeftConfig specifying the adapter method and hyperparameters, then call get_peft_model() which wraps their base model with adapter functionality. During training, input tensors flow through the base model with active adapters adding their learned parameters to the forward pass. Adapter weights are saved separately from the base model and can be loaded, merged, or composed with other adapters.
Under the hood, the system uses 2 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
A 8-component library. 381 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Users create a PeftConfig specifying the adapter method and hyperparameters, then call get_peft_model() which wraps their base model with adapter functionality. During training, input tensors flow through the base model with active adapters adding their learned parameters to the forward pass. Adapter weights are saved separately from the base model and can be loaded, merged, or composed with other adapters.
- Configuration creation — User instantiates a method-specific config like LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj', 'v_proj']) specifying which layers to adapt and hyperparameters (config: lora_r, lora_alpha, lora_dropout +1)
- Model wrapping — get_peft_model() uses PEFT_TYPE_TO_MODEL_MAPPING to find the correct tuner class, then creates a PeftModel wrapper that replaces target modules with adapter-enabled versions [PeftConfig → PeftModel] (config: peft_type, task_type, target_modules)
- Layer replacement — The tuner implementation (e.g., LoraModel) identifies target modules by name patterns, replaces them with adapter layers (e.g., LoRALayer), and initializes adapter parameters [PeftModel → LoRALayer] (config: target_modules, r, lora_alpha +1)
- Forward pass adaptation — During training, input tensors flow through base model layers, but adapter layers compute additional outputs like (B @ A) * scaling and add them to the original layer output [TrainingBatch → TrainingBatch] (config: lora_alpha, scaling)
- Gradient accumulation — Gradients are computed only for adapter parameters while base model weights remain frozen, enabling parameter-efficient training with much smaller memory footprint [TrainingBatch]
- Adapter persistence — save_pretrained() extracts only the adapter weights (not base model) and saves them with the config as a small checkpoint, typically <1GB vs >10GB for full models [PeftModel]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
src/peft/config.pyBase dataclass with peft_type: PeftType, task_type: TaskType, inference_mode: bool, plus method-specific subclasses like LoraConfig with r: int, lora_alpha: int, target_modules: list[str], lora_dropout: float
Created by user code, passed to get_peft_model(), used by tuner implementations to configure adapter parameters
src/peft/peft_model.pyWrapper class containing base_model: PreTrainedModel, peft_config: dict[str, PeftConfig], active_adapters: list[str], plus state tracking for merged/enabled adapters
Created by get_peft_model(), used throughout training/inference, manages adapter lifecycle and state transitions
src/peft/tuners/lora/layer.pyModule containing lora_A: nn.ModuleDict[str, nn.Linear], lora_B: nn.ModuleDict[str, nn.Linear], scaling: dict[str, float] for low-rank decomposition W + (B @ A) * scaling
Replaces original linear layers in base model, accumulates gradients during training, can be merged back into base weights
src/peft/peft_model.pyDataclass with name: str, module_type: str, enabled: bool, active_adapters: list[str], merged_adapters: list[str], available_adapters: list[str], devices: dict[str, list[str]]
Generated by get_model_info(), provides runtime state of all adapters attached to a PeftModel
examples/*/Dict with input_ids: torch.Tensor[batch_size, seq_len], attention_mask: torch.Tensor[batch_size, seq_len], labels: torch.Tensor[batch_size, seq_len] for language modeling tasks
Tokenized from raw text, batched by DataCollator, consumed by model.forward() during training
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The dataframe df contains exactly two numeric columns with names matching metric_x and metric_y parameters, and these columns have no NaN or infinite values
If this fails: If metric columns are missing, contain NaN values, or have mismatched data types, the Pareto frontier computation silently fails or produces wrong dominance relationships between model configurations
examples/adamss_finetuning/glue_adamss_asa_example.py:compute_pareto_frontier
The manual update_and_allocate() calls happen at exactly the right training steps and are never called simultaneously with AdamssAsaCallback - the code assumes users follow the exclusive usage pattern documented in comments
If this fails: If both manual calls and callback are used together, ASA state becomes inconsistent leading to incorrect subspace allocation decisions and degraded adapter performance
examples/adamss_finetuning/glue_adamss_asa_manual_example.py:update_and_allocate
CUDA device 'cuda:0' exists and is available when torch.cuda.is_available() returns True, with sufficient VRAM for face alignment model plus ControlNet inference
If this fails: If cuda:0 is occupied by another process or lacks sufficient memory, face alignment initialization fails with cryptic CUDA out-of-memory errors during evaluation
examples/boft_controlnet/eval.py:device_selection
The system has sufficient memory to load a 4-bit quantized phi3-mini model plus multiple task-specific adapters simultaneously - typically requiring 8-12GB GPU memory
If this fails: On systems with <8GB VRAM, model loading fails with CUDA OOM during adapter composition, but the error message doesn't indicate the specific memory requirements
examples/arrow_multitask/arrow_phi3_mini.py:quantization
All metrics follow the hardcoded preference directions in metric_preferences dict (e.g., 'test_accuracy' should be maximized, 'train_loss' minimized) and forgetting metrics contain asterisks for pattern matching
If this fails: If new metrics are added without updating preferences or existing metrics change semantics (e.g., a loss that should be maximized), Pareto frontier computation inverts dominance relationships and recommends worse model configurations
method_comparison/app.py:metric_preferences
The local server at localhost:8000 can handle num_requests (default 32) concurrent connections without rate limiting or connection refused errors
If this fails: When the server connection pool is exhausted, some async requests hang indefinitely while others succeed, leading to incomplete performance measurements and timeouts
examples/bdlora_finetuning/chat.py:async_requests
Adapter checkpoint files are in safetensors format with specific key naming conventions that match the PeftModel.from_pretrained() expectations
If this fails: If checkpoints are saved in different formats or have mismatched tensor names, loading fails silently and the model runs with random adapter weights instead of trained ones
examples/boft_controlnet/test_controlnet.py:safetensors_loading
ASA callback triggers happen at consistent intervals during training and the model's subspace allocation state remains coherent across callback invocations
If this fails: If training is interrupted and resumed, or if callback timing is inconsistent due to hardware issues, subspace allocation becomes desynchronized leading to suboptimal adapter performance
examples/adamss_finetuning/image_classification_adamss_asa.py:asa_callback
The hardcoded subset sizes (100 train samples, 50 validation samples) provide meaningful signal for AdaMSS hyperparameter validation across different model architectures
If this fails: For models requiring larger datasets to show adapter effectiveness, the tiny test dataset produces misleading results that don't correlate with full-scale performance
examples/adamss_finetuning/test_adamss_quick.py:dataset_subset
HF_TOKEN environment variable contains a valid HuggingFace authentication token when push_to_hub is True, and the token has write permissions to the specified hub_model_id repository
If this fails: Upload attempts fail with authentication errors but training continues normally, resulting in trained adapters that exist only locally without any error indication
examples/alora_finetuning/alora_finetuning.py:hf_token
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
PeftModel tracks multiple adapters by name, maintaining their configs, states (enabled/disabled/merged), and device locations
Stores reference to original base model and can restore original weights when adapters are unmerged
Downloads and caches adapter checkpoints from HF Hub using transformers' caching mechanism
Feedback Loops
- Adaptive Rank Adjustment (training-loop, balancing) — Trigger: AdaLoRA importance score computation. Action: Prune low-importance adapter dimensions and grow high-importance ones. Exit: Training completion or rank convergence.
- Multi-Adapter Composition (recursive, reinforcing) — Trigger: Multiple adapters enabled simultaneously. Action: Each adapter layer applies its transformation to the output of the previous adapter or base layer. Exit: All active adapters processed.
Delays
- Model Download (cache-ttl, ~Seconds to minutes) — First-time loading of base models or adapters from HuggingFace Hub requires download time
- Layer Replacement (compilation, ~Milliseconds) — Initial wrapping with get_peft_model() requires traversing model graph and replacing target modules
Control Points
- Adapter Enable/Disable (runtime-toggle) — Controls: Whether specific adapters contribute to forward pass computations. Default: Per-adapter boolean state
- Inference Mode (feature-flag) — Controls: Whether adapters are merged into base weights for faster inference vs kept separate for flexibility. Default: inference_mode boolean in PeftConfig
- Target Module Selection (architecture-switch) — Controls: Which layers receive adapters (query/value projections, all linear layers, specific named modules). Default: target_modules list in config
- Rank Hyperparameter (hyperparameter) — Controls: Bottleneck dimension determining adapter capacity and trainable parameter count. Default: r integer in LoraConfig
Technology Stack
Provides tensor operations, autograd, and nn.Module base classes for implementing adapter layers and managing gradients
Supplies base model architectures, tokenizers, and training utilities that PEFT adapters are applied to
Handles distributed training, mixed precision, and device management for PEFT models across multiple GPUs/TPUs
Serializes adapter weights in a secure, efficient format for saving/loading checkpoints
Hosts and serves pretrained adapters, enabling easy sharing and reuse of PEFT checkpoints
Enables quantized training with PEFT adapters on top of 4-bit/8-bit base models for memory efficiency
Key Components
- get_peft_model (factory) — Creates a PeftModel wrapper around a base model using the specified adapter configuration - the main entry point for applying PEFT methods
src/peft/mapping.py - PeftModel (orchestrator) — Manages multiple adapters attached to a base model, handles adapter state transitions (enable/disable/merge), and routes forward passes through active adapters
src/peft/peft_model.py - LoraModel (transformer) — Implements Low-Rank Adaptation by replacing target linear layers with LoRALayer modules that learn low-rank updates W + (B @ A) * scaling
src/peft/tuners/lora/model.py - TaskType (registry) — Enum defining supported task types (CAUSAL_LM, SEQ_2_SEQ_LM, TOKEN_CLS, etc.) that determine which model outputs to use for different objectives
src/peft/utils/peft_types.py - AdaLoraModel (optimizer) — Extends LoRA with adaptive rank selection - dynamically adjusts the rank of each adapter during training based on importance scores
src/peft/tuners/adalora/model.py - PromptTuningModel (transformer) — Prepends learnable prompt tokens to input embeddings instead of modifying model weights - useful for frozen model adaptation
src/peft/tuners/prompt_tuning/model.py - HubMixin (adapter) — Provides save_pretrained() and from_pretrained() methods for uploading/downloading adapter weights to/from HuggingFace Hub
src/peft/utils/hub_utils.py - PEFT_TYPE_TO_MODEL_MAPPING (registry) — Maps PeftType enum values to their corresponding implementation classes (LORA -> LoraModel, ADALORA -> AdaLoraModel, etc.)
src/peft/mapping.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaCompare peft
Related Library Repositories
Frequently Asked Questions
What is peft used for?
Applies parameter-efficient fine-tuning methods to pretrained models using various adapter architectures huggingface/peft is a 8-component library written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 381 files.
How is peft architected?
peft is organized into 4 architecture layers: Configuration Layer, Model Wrapper Layer, Tuner Implementation Layer, Integration Layer. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through peft?
Data moves through 6 stages: Configuration creation → Model wrapping → Layer replacement → Forward pass adaptation → Gradient accumulation → .... Users create a PeftConfig specifying the adapter method and hyperparameters, then call get_peft_model() which wraps their base model with adapter functionality. During training, input tensors flow through the base model with active adapters adding their learned parameters to the forward pass. Adapter weights are saved separately from the base model and can be loaded, merged, or composed with other adapters. This pipeline design reflects a complex multi-stage processing system.
What technologies does peft use?
The core stack includes PyTorch (Provides tensor operations, autograd, and nn.Module base classes for implementing adapter layers and managing gradients), Transformers (Supplies base model architectures, tokenizers, and training utilities that PEFT adapters are applied to), Accelerate (Handles distributed training, mixed precision, and device management for PEFT models across multiple GPUs/TPUs), Safetensors (Serializes adapter weights in a secure, efficient format for saving/loading checkpoints), HuggingFace Hub (Hosts and serves pretrained adapters, enabling easy sharing and reuse of PEFT checkpoints), BitsAndBytes (Enables quantized training with PEFT adapters on top of 4-bit/8-bit base models for memory efficiency). A focused set of dependencies that keeps the build manageable.
What system dynamics does peft have?
peft exhibits 3 data pools (Adapter Registry, Base Model Cache), 2 feedback loops, 4 control points, 2 delays. The feedback loops handle training-loop and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does peft use?
4 design patterns detected: Adapter Pattern, Strategy Pattern, Registry Pattern, Mixin Pattern.
How does peft compare to alternatives?
CodeSea has side-by-side architecture comparisons of peft with unsloth, trl. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.