hiyouga/llamafactory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Fine-tunes 100+ LLMs and VLMs using efficient methods like LoRA, QLoRA, and full parameter training
Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses.
Under the hood, the system uses 4 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.
A 9-component ml training. 278 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses.
- Parse configuration and arguments — The CLI parser in main() processes command line arguments and config files to create ModelArguments, TrainingArguments, DataArguments, and FinetuningArguments dataclasses with validation (config: model_name_or_path, template, finetuning_type +2)
- Load and configure base model — load_model_and_tokenizer() loads the specified model with quantization settings, applies flash attention and RoPE scaling, wraps with LoRA adapters if specified, and configures the tokenizer with proper special tokens [ModelArguments → Model] (config: quantization_bit, flash_attn, rope_scaling +2)
- Process training datasets — get_dataset() loads datasets from HuggingFace or local files, applies conversation templates through get_template_and_fix_tokenizer(), then preprocess_supervised_dataset() tokenizes conversations and creates training batches with proper label masking [HF Dataset → TrainingBatch] (config: dataset, template, cutoff_len +1)
- Execute training loop — CustomSeq2SeqTrainer orchestrates the training with forward passes computing cross-entropy loss on conversation completions, backward passes updating only LoRA parameters via gradient descent, and periodic evaluation on validation sets [TrainingBatch → Trained model] (config: learning_rate, per_device_train_batch_size, gradient_accumulation_steps +1)
- Save model and adapters — The trainer saves LoRA adapter weights, merges them with base model if requested, and exports in HuggingFace format with proper configuration files and tokenizer settings [Trained model → Saved checkpoint] (config: output_dir, save_strategy, save_steps)
- Serve model for inference — ChatModel loads the fine-tuned model and serves it through FastAPI endpoints, processing ChatMessage objects through conversation templates and generating responses with configurable sampling parameters [ChatMessage → ResponseMetadata] (config: temperature, top_p, max_new_tokens +1)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
src/llamafactory/data/processor/supervised.pydict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len] where labels use IGNORE_INDEX (-100) for positions that shouldn't contribute to loss
Created from tokenized conversation data with proper masking for instruction-following tasks, flows through the model for forward/backward passes, then discarded after gradient computation
src/llamafactory/hparams/model_args.pydataclass with model_name_or_path: str, adapter_name_or_path: Optional[str], template: Optional[str], flash_attn: str, rope_scaling: Optional[str], and 20+ model configuration fields
Parsed from command line or config files, validated for compatibility, then used throughout model loading and training setup to control architecture and optimization settings
src/llamafactory/api/protocol.pyPydantic model with role: Role (system/user/assistant/tool), content: str or list[dict] for multimodal, tool_calls: Optional[list[ToolCall]], name: Optional[str]
Received from API clients as JSON, validated through Pydantic, converted to internal conversation format via templates, then processed by the model for response generation
src/llamafactory/data/mm_plugin.pydict containing images: Optional[Tensor[B, C, H, W]], videos: Optional[Tensor[B, F, C, H, W]], audio: Optional[Tensor[B, seq_len]], with processor-specific keys for different VLM architectures
Extracted from conversation data containing image/video/audio URLs, processed through architecture-specific multimodal processors, then combined with text tokens for unified model input
src/llamafactory/hparams/training_args.pydataclass extending HF TrainingArguments with output_dir: str, learning_rate: float, num_train_epochs: int, per_device_train_batch_size: int, gradient_accumulation_steps: int, plus LoRA-specific and optimization parameters
Assembled from CLI args and config files, validated for hardware constraints and method compatibility, then passed to HuggingFace Trainer to control the entire training process
src/llamafactory/model/adapter.pyPEFT LoraConfig with r: int (rank), lora_alpha: int, lora_dropout: float, target_modules: list[str], task_type: TaskType, plus method-specific parameters for QLoRA, LongLoRA variants
Constructed from ModelArguments during model loading, used by PEFT library to wrap the base model with efficient low-rank adapters, then controls which parameters receive gradients during training
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The API_KEY environment variable, if set, contains a valid bearer token string without parsing or format validation
If this fails: If API_KEY is set to malformed JSON, contains newlines, or uses unexpected encoding, authentication will silently fail with confusing 401 errors instead of clear validation messages
src/llamafactory/api/app.py:create_app
GPU memory cleanup every 300 seconds is sufficient to prevent out-of-memory crashes regardless of request volume, model size, or batch sizes
If this fails: High-throughput APIs serving large models could accumulate GPU memory faster than the cleanup interval, leading to CUDA OOM errors between sweeps
src/llamafactory/api/app.py:sweeper
Image URLs (like qianwen-res.oss-cn-beijing.aliyuncs.com) will remain accessible and return images in formats the model processor expects
If this fails: If external image URLs become inaccessible, return 404s, or serve different content types, multimodal inference will fail with cryptic tensor shape errors rather than clear network/format errors
scripts/api_example/test_image.py:messages
The vocab_size hardcoded as 32768 matches the actual vocabulary size of all models used in benchmarking
If this fails: Benchmarking models with different vocabulary sizes (like 128K vocab models) will generate invalid token IDs, leading to embedding lookup errors or meaningless performance metrics
scripts/bench_qwen.py:DummyDataset.__init__
All grade inputs will be exactly 'A', 'B', or 'C' strings, and the hours list will have the same length as grades
If this fails: Passing grades like 'A+', 'D', or mismatched list lengths will cause KeyError or index errors instead of graceful validation failures
scripts/api_example/test_toolcall.py:calculate_gpa
All .bin files in the input directory contain PyTorch state dict fragments that can be safely merged by updating an OrderedDict
If this fails: If .bin files contain non-tensor data, have conflicting keys, or are corrupted, torch.load will fail or silently create invalid merged state dictionaries
scripts/convert_ckpt/llamafy_baichuan2.py:save_weight
The hardcoded image token calculation (18 * 18 // (2 * 2)) matches the specific vision encoder architecture being benchmarked
If this fails: Different vision models with other patch sizes or pooling strategies will produce tensor shape mismatches in multimodal forward passes, causing silent incorrect results or crashes
scripts/bench_qwen.py:DummyDataset
URL path structure always follows the pattern '/lang/' and can be safely replaced with string manipulation
If this fails: Complex URL paths, encoded characters, or paths without language prefixes will cause invalid redirects that break navigation or lose query parameters
docs/_static/js/switcher.js:select.addEventListener
Source checkpoint files contain only tensors in formats that safetensors.safe_open() and torch.load() can handle without compatibility issues
If this fails: Mixed checkpoint formats, custom tensor types, or version mismatches between safetensors and PyTorch will cause conversion failures with unclear error messages
scripts/convert_ckpt/llamafy_qwen.py:qwen_state_dict
GPU memory cleanup task will continue running throughout the FastAPI application lifecycle without being cancelled or blocked
If this fails: If the cleanup task gets cancelled by asyncio or blocked by long-running operations, memory will accumulate indefinitely until the process crashes
src/llamafactory/api/app.py:lifespan
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Maps model names to their configurations, supported features, and template requirements across 100+ model architectures
Accumulates training checkpoints, adapter weights, optimizer states, and training logs throughout the training process
Caches processed datasets and tokenized examples to avoid recomputation across training runs
Stores base model weights, LoRA adapters, and merged checkpoints in safetensors or PyTorch format
Feedback Loops
- Training optimization loop (training-loop, balancing) — Trigger: Start of each training step. Action: Forward pass computes loss on instruction data, backward pass updates LoRA parameters, optimizer applies gradients with learning rate scheduling. Exit: Reaches max_steps or num_train_epochs.
- Gradient accumulation cycle (gradient-accumulation, balancing) — Trigger: Each micro-batch in training step. Action: Accumulates gradients across gradient_accumulation_steps micro-batches before applying optimizer step. Exit: Reaches gradient_accumulation_steps count.
- Learning rate scheduling (training-loop, balancing) — Trigger: Each optimizer step completion. Action: Adjusts learning rate based on warmup schedule and decay strategy. Exit: Training completion.
- Checkpoint saving cycle (checkpoint-save, reinforcing) — Trigger: Every save_steps training steps or epoch completion. Action: Saves model state, optimizer state, and training metrics to disk. Exit: Training completion or disk space exhaustion.
Delays
- Model loading delay (warmup, ~10-60 seconds depending on model size and quantization) — Blocks training start while model weights are loaded from disk and quantized if specified
- Dataset preprocessing delay (batch-window, ~Variable based on dataset size and tokenization complexity) — All conversation data must be tokenized and batched before training can begin
- Checkpoint save delay (checkpoint-save, ~5-30 seconds per checkpoint depending on model size) — Training pauses while model state is serialized to disk at regular intervals
- Distributed setup delay (warmup, ~5-15 seconds for multi-GPU initialization) — All processes must synchronize and establish communication before distributed training begins
Control Points
- Finetuning method selection (architecture-switch) — Controls: Whether to use LoRA, QLoRA, full parameter training, or freeze methods - fundamentally changes which parameters receive gradients. Default: finetuning_type parameter
- Quantization precision (precision-mode) — Controls: Model precision (4-bit, 8-bit, fp16, bf16) affecting memory usage and training stability. Default: quantization_bit parameter
- Learning rate schedule (hyperparameter) — Controls: Optimizer step size and warmup behavior affecting convergence speed and stability. Default: learning_rate, warmup_steps, lr_scheduler_type parameters
- Attention mechanism (architecture-switch) — Controls: Whether to use flash attention, sdpa, or eager attention affecting memory efficiency and speed. Default: flash_attn parameter
- Template selection (architecture-switch) — Controls: Conversation formatting and special token usage for different model families. Default: template parameter
- Batch size scaling (hyperparameter) — Controls: Memory usage and gradient noise through micro_batch_size and gradient_accumulation_steps. Default: per_device_train_batch_size, gradient_accumulation_steps parameters
Technology Stack
Core tensor operations, automatic differentiation, and model execution across CPUs and GPUs with support for distributed training
Pre-trained model loading, tokenization, and the Trainer framework that orchestrates the training loop with logging and checkpointing
Parameter-efficient fine-tuning methods including LoRA, QLoRA, and AdaLoRA that wrap base models with trainable adapters
4-bit and 8-bit quantization for memory-efficient training and inference of large models
Multi-GPU and distributed training coordination with automatic mixed precision and gradient synchronization
HTTP API server providing OpenAI-compatible endpoints for chat completions and model serving with async support
Web-based user interface for interactive training configuration, real-time monitoring, and model testing
Efficient loading and processing of training datasets from HuggingFace Hub and local files with streaming support
Key Components
- ChatModel (orchestrator) — Coordinates model loading, tokenizer setup, and generation pipeline with support for different inference engines (HuggingFace, vLLM, LlamaCpp) and handles both text and multimodal conversations
src/llamafactory/chat/chat_model.py - CustomSeq2SeqTrainer (orchestrator) — Extends HuggingFace Trainer with custom loss computation, prediction processing, and evaluation metrics specifically designed for instruction-following and conversation fine-tuning
src/llamafactory/train/sft/trainer.py - get_dataset (loader) — Loads and combines multiple datasets from HuggingFace Hub or local files, applies dataset-specific formatting, handles streaming for large datasets, and manages train/eval splits
src/llamafactory/data/loader.py - get_template_and_fix_tokenizer (adapter) — Selects conversation templates for different model families (Llama, Qwen, ChatGLM, etc.), configures special tokens, and modifies tokenizer settings for proper instruction formatting
src/llamafactory/data/template.py - load_model_and_tokenizer (factory) — Creates model instances with proper quantization (4-bit, 8-bit), applies efficient training methods (LoRA, QLoRA, full), configures flash attention and rope scaling, and handles multimodal processors
src/llamafactory/model/loader.py - preprocess_supervised_dataset (processor) — Converts raw conversation data into tokenized training examples with proper label masking, handles prompt-response separation for instruction following, and applies conversation templates
src/llamafactory/data/processor/supervised.py - run_sft (executor) — Orchestrates the complete supervised fine-tuning pipeline from dataset loading through model training, handles distributed training setup, checkpoint saving, and evaluation
src/llamafactory/train/sft/workflow.py - MultiModalPlugin (processor) — Handles multimodal data processing for vision-language models by downloading media from URLs, applying model-specific preprocessing, and integrating visual/audio tokens with text
src/llamafactory/data/mm_plugin.py - create_app (gateway) — Creates FastAPI application with OpenAI-compatible endpoints for chat completions, handles streaming responses, authentication, and CORS configuration for model serving
src/llamafactory/api/app.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is LlamaFactory used for?
Fine-tunes 100+ LLMs and VLMs using efficient methods like LoRA, QLoRA, and full parameter training hiyouga/llamafactory is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 278 files.
How is LlamaFactory architected?
LlamaFactory is organized into 5 architecture layers: Interface Layer, Configuration Layer, Training Engine, Model Management, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through LlamaFactory?
Data moves through 6 stages: Parse configuration and arguments → Load and configure base model → Process training datasets → Execute training loop → Save model and adapters → .... Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses. This pipeline design reflects a complex multi-stage processing system.
What technologies does LlamaFactory use?
The core stack includes PyTorch (Core tensor operations, automatic differentiation, and model execution across CPUs and GPUs with support for distributed training), HuggingFace Transformers (Pre-trained model loading, tokenization, and the Trainer framework that orchestrates the training loop with logging and checkpointing), PEFT (Parameter-efficient fine-tuning methods including LoRA, QLoRA, and AdaLoRA that wrap base models with trainable adapters), bitsandbytes (4-bit and 8-bit quantization for memory-efficient training and inference of large models), Accelerate (Multi-GPU and distributed training coordination with automatic mixed precision and gradient synchronization), FastAPI (HTTP API server providing OpenAI-compatible endpoints for chat completions and model serving with async support), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does LlamaFactory have?
LlamaFactory exhibits 4 data pools (Model Registry, Checkpoint Directory), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does LlamaFactory use?
5 design patterns detected: Parameter-Efficient Training, Template-Based Conversation Formatting, Multi-Method Training Pipeline, Extensible Model Registry, Multimodal Data Integration.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.