hiyouga/llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

70,341 stars Python 9 components

Fine-tunes 100+ LLMs and VLMs using efficient methods like LoRA, QLoRA, and full parameter training

Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses.

Under the hood, the system uses 4 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.

A 9-component ml training. 278 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Parse configuration and arguments — The CLI parser in main() processes command line arguments and config files to create ModelArguments, TrainingArguments, DataArguments, and FinetuningArguments dataclasses with validation (config: model_name_or_path, template, finetuning_type +2)
Load and configure base model — load_model_and_tokenizer() loads the specified model with quantization settings, applies flash attention and RoPE scaling, wraps with LoRA adapters if specified, and configures the tokenizer with proper special tokens [ModelArguments → Model] (config: quantization_bit, flash_attn, rope_scaling +2)
Process training datasets — get_dataset() loads datasets from HuggingFace or local files, applies conversation templates through get_template_and_fix_tokenizer(), then preprocess_supervised_dataset() tokenizes conversations and creates training batches with proper label masking [HF Dataset → TrainingBatch] (config: dataset, template, cutoff_len +1)
Execute training loop — CustomSeq2SeqTrainer orchestrates the training with forward passes computing cross-entropy loss on conversation completions, backward passes updating only LoRA parameters via gradient descent, and periodic evaluation on validation sets [TrainingBatch → Trained model] (config: learning_rate, per_device_train_batch_size, gradient_accumulation_steps +1)
Save model and adapters — The trainer saves LoRA adapter weights, merges them with base model if requested, and exports in HuggingFace format with proper configuration files and tokenizer settings [Trained model → Saved checkpoint] (config: output_dir, save_strategy, save_steps)
Serve model for inference — ChatModel loads the fine-tuned model and serves it through FastAPI endpoints, processing ChatMessage objects through conversation templates and generating responses with configurable sampling parameters [ChatMessage → ResponseMetadata] (config: temperature, top_p, max_new_tokens +1)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TrainingBatch src/llamafactory/data/processor/supervised.py
dict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len] where labels use IGNORE_INDEX (-100) for positions that shouldn't contribute to loss
Created from tokenized conversation data with proper masking for instruction-following tasks, flows through the model for forward/backward passes, then discarded after gradient computation

ModelArguments src/llamafactory/hparams/model_args.py
dataclass with model_name_or_path: str, adapter_name_or_path: Optional[str], template: Optional[str], flash_attn: str, rope_scaling: Optional[str], and 20+ model configuration fields
Parsed from command line or config files, validated for compatibility, then used throughout model loading and training setup to control architecture and optimization settings

ChatMessage src/llamafactory/api/protocol.py
Pydantic model with role: Role (system/user/assistant/tool), content: str or list[dict] for multimodal, tool_calls: Optional[list[ToolCall]], name: Optional[str]
Received from API clients as JSON, validated through Pydantic, converted to internal conversation format via templates, then processed by the model for response generation

MultiModalInputs src/llamafactory/data/mm_plugin.py
dict containing images: Optional[Tensor[B, C, H, W]], videos: Optional[Tensor[B, F, C, H, W]], audio: Optional[Tensor[B, seq_len]], with processor-specific keys for different VLM architectures
Extracted from conversation data containing image/video/audio URLs, processed through architecture-specific multimodal processors, then combined with text tokens for unified model input

TrainingArguments src/llamafactory/hparams/training_args.py
dataclass extending HF TrainingArguments with output_dir: str, learning_rate: float, num_train_epochs: int, per_device_train_batch_size: int, gradient_accumulation_steps: int, plus LoRA-specific and optimization parameters
Assembled from CLI args and config files, validated for hardware constraints and method compatibility, then passed to HuggingFace Trainer to control the entire training process

LoRAConfig src/llamafactory/model/adapter.py
PEFT LoraConfig with r: int (rank), lora_alpha: int, lora_dropout: float, target_modules: list[str], task_type: TaskType, plus method-specific parameters for QLoRA, LongLoRA variants
Constructed from ModelArguments during model loading, used by PEFT library to wrap the base model with efficient low-rank adapters, then controls which parameters receive gradients during training

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

The API_KEY environment variable, if set, contains a valid bearer token string without parsing or format validation

If this fails: If API_KEY is set to malformed JSON, contains newlines, or uses unexpected encoding, authentication will silently fail with confusing 401 errors instead of clear validation messages

src/llamafactory/api/app.py:create_app

critical Resource unguarded

GPU memory cleanup every 300 seconds is sufficient to prevent out-of-memory crashes regardless of request volume, model size, or batch sizes

If this fails: High-throughput APIs serving large models could accumulate GPU memory faster than the cleanup interval, leading to CUDA OOM errors between sweeps

src/llamafactory/api/app.py:sweeper

warning Domain unguarded

Image URLs (like qianwen-res.oss-cn-beijing.aliyuncs.com) will remain accessible and return images in formats the model processor expects

If this fails: If external image URLs become inaccessible, return 404s, or serve different content types, multimodal inference will fail with cryptic tensor shape errors rather than clear network/format errors

scripts/api_example/test_image.py:messages

warning Contract unguarded

The vocab_size hardcoded as 32768 matches the actual vocabulary size of all models used in benchmarking

If this fails: Benchmarking models with different vocabulary sizes (like 128K vocab models) will generate invalid token IDs, leading to embedding lookup errors or meaningless performance metrics

scripts/bench_qwen.py:DummyDataset.__init__

warning Domain unguarded

All grade inputs will be exactly 'A', 'B', or 'C' strings, and the hours list will have the same length as grades

If this fails: Passing grades like 'A+', 'D', or mismatched list lengths will cause KeyError or index errors instead of graceful validation failures

scripts/api_example/test_toolcall.py:calculate_gpa

critical Ordering unguarded

All .bin files in the input directory contain PyTorch state dict fragments that can be safely merged by updating an OrderedDict

If this fails: If .bin files contain non-tensor data, have conflicting keys, or are corrupted, torch.load will fail or silently create invalid merged state dictionaries

scripts/convert_ckpt/llamafy_baichuan2.py:save_weight

warning Scale unguarded

The hardcoded image token calculation (18 * 18 // (2 * 2)) matches the specific vision encoder architecture being benchmarked

If this fails: Different vision models with other patch sizes or pooling strategies will produce tensor shape mismatches in multimodal forward passes, causing silent incorrect results or crashes

scripts/bench_qwen.py:DummyDataset

info Environment unguarded

URL path structure always follows the pattern '/lang/' and can be safely replaced with string manipulation

If this fails: Complex URL paths, encoded characters, or paths without language prefixes will cause invalid redirects that break navigation or lose query parameters

docs/_static/js/switcher.js:select.addEventListener

warning Contract weakly guarded

Source checkpoint files contain only tensors in formats that safetensors.safe_open() and torch.load() can handle without compatibility issues

If this fails: Mixed checkpoint formats, custom tensor types, or version mismatches between safetensors and PyTorch will cause conversion failures with unclear error messages

scripts/convert_ckpt/llamafy_qwen.py:qwen_state_dict

warning Temporal unguarded

GPU memory cleanup task will continue running throughout the FastAPI application lifecycle without being cancelled or blocked

If this fails: If the cleanup task gets cancelled by asyncio or blocked by long-running operations, memory will accumulate indefinitely until the process crashes

src/llamafactory/api/app.py:lifespan

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Registry (registry)
Maps model names to their configurations, supported features, and template requirements across 100+ model architectures

Checkpoint Directory (file-store)
Accumulates training checkpoints, adapter weights, optimizer states, and training logs throughout the training process

Dataset Cache (cache)
Caches processed datasets and tokenized examples to avoid recomputation across training runs

Model Weight Store (file-store)
Stores base model weights, LoRA adapters, and merged checkpoints in safetensors or PyTorch format

Feedback Loops

Training optimization loop (training-loop, balancing) — Trigger: Start of each training step. Action: Forward pass computes loss on instruction data, backward pass updates LoRA parameters, optimizer applies gradients with learning rate scheduling. Exit: Reaches max_steps or num_train_epochs.
Gradient accumulation cycle (gradient-accumulation, balancing) — Trigger: Each micro-batch in training step. Action: Accumulates gradients across gradient_accumulation_steps micro-batches before applying optimizer step. Exit: Reaches gradient_accumulation_steps count.
Learning rate scheduling (training-loop, balancing) — Trigger: Each optimizer step completion. Action: Adjusts learning rate based on warmup schedule and decay strategy. Exit: Training completion.
Checkpoint saving cycle (checkpoint-save, reinforcing) — Trigger: Every save_steps training steps or epoch completion. Action: Saves model state, optimizer state, and training metrics to disk. Exit: Training completion or disk space exhaustion.

Delays

Model loading delay (warmup, ~10-60 seconds depending on model size and quantization) — Blocks training start while model weights are loaded from disk and quantized if specified
Dataset preprocessing delay (batch-window, ~Variable based on dataset size and tokenization complexity) — All conversation data must be tokenized and batched before training can begin
Checkpoint save delay (checkpoint-save, ~5-30 seconds per checkpoint depending on model size) — Training pauses while model state is serialized to disk at regular intervals
Distributed setup delay (warmup, ~5-15 seconds for multi-GPU initialization) — All processes must synchronize and establish communication before distributed training begins

Control Points

Finetuning method selection (architecture-switch) — Controls: Whether to use LoRA, QLoRA, full parameter training, or freeze methods - fundamentally changes which parameters receive gradients. Default: finetuning_type parameter
Quantization precision (precision-mode) — Controls: Model precision (4-bit, 8-bit, fp16, bf16) affecting memory usage and training stability. Default: quantization_bit parameter
Learning rate schedule (hyperparameter) — Controls: Optimizer step size and warmup behavior affecting convergence speed and stability. Default: learning_rate, warmup_steps, lr_scheduler_type parameters
Attention mechanism (architecture-switch) — Controls: Whether to use flash attention, sdpa, or eager attention affecting memory efficiency and speed. Default: flash_attn parameter
Template selection (architecture-switch) — Controls: Conversation formatting and special token usage for different model families. Default: template parameter
Batch size scaling (hyperparameter) — Controls: Memory usage and gradient noise through micro_batch_size and gradient_accumulation_steps. Default: per_device_train_batch_size, gradient_accumulation_steps parameters

Technology Stack

PyTorch (framework)
Core tensor operations, automatic differentiation, and model execution across CPUs and GPUs with support for distributed training

HuggingFace Transformers (framework)
Pre-trained model loading, tokenization, and the Trainer framework that orchestrates the training loop with logging and checkpointing

PEFT (library)
Parameter-efficient fine-tuning methods including LoRA, QLoRA, and AdaLoRA that wrap base models with trainable adapters

bitsandbytes (library)
4-bit and 8-bit quantization for memory-efficient training and inference of large models

Accelerate (library)
Multi-GPU and distributed training coordination with automatic mixed precision and gradient synchronization

FastAPI (framework)
HTTP API server providing OpenAI-compatible endpoints for chat completions and model serving with async support

Gradio (framework)
Web-based user interface for interactive training configuration, real-time monitoring, and model testing

Datasets (library)
Efficient loading and processing of training datasets from HuggingFace Hub and local files with streaming support

Key Components

ChatModel (orchestrator) — Coordinates model loading, tokenizer setup, and generation pipeline with support for different inference engines (HuggingFace, vLLM, LlamaCpp) and handles both text and multimodal conversations src/llamafactory/chat/chat_model.py
CustomSeq2SeqTrainer (orchestrator) — Extends HuggingFace Trainer with custom loss computation, prediction processing, and evaluation metrics specifically designed for instruction-following and conversation fine-tuning src/llamafactory/train/sft/trainer.py
get_dataset (loader) — Loads and combines multiple datasets from HuggingFace Hub or local files, applies dataset-specific formatting, handles streaming for large datasets, and manages train/eval splits src/llamafactory/data/loader.py
get_template_and_fix_tokenizer (adapter) — Selects conversation templates for different model families (Llama, Qwen, ChatGLM, etc.), configures special tokens, and modifies tokenizer settings for proper instruction formatting src/llamafactory/data/template.py
load_model_and_tokenizer (factory) — Creates model instances with proper quantization (4-bit, 8-bit), applies efficient training methods (LoRA, QLoRA, full), configures flash attention and rope scaling, and handles multimodal processors src/llamafactory/model/loader.py
preprocess_supervised_dataset (processor) — Converts raw conversation data into tokenized training examples with proper label masking, handles prompt-response separation for instruction following, and applies conversation templates src/llamafactory/data/processor/supervised.py
run_sft (executor) — Orchestrates the complete supervised fine-tuning pipeline from dataset loading through model training, handles distributed training setup, checkpoint saving, and evaluation src/llamafactory/train/sft/workflow.py
MultiModalPlugin (processor) — Handles multimodal data processing for vision-language models by downloading media from URLs, applying model-specific preprocessing, and integrating visual/audio tokens with text src/llamafactory/data/mm_plugin.py
create_app (gateway) — Creates FastAPI application with OpenAI-compatible endpoints for chat completions, handles streaming responses, authentication, and CORS configuration for model serving src/llamafactory/api/app.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is LlamaFactory used for?

Fine-tunes 100+ LLMs and VLMs using efficient methods like LoRA, QLoRA, and full parameter training hiyouga/llamafactory is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 278 files.

How is LlamaFactory architected?

LlamaFactory is organized into 5 architecture layers: Interface Layer, Configuration Layer, Training Engine, Model Management, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through LlamaFactory?

Data moves through 6 stages: Parse configuration and arguments → Load and configure base model → Process training datasets → Execute training loop → Save model and adapters → .... Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does LlamaFactory use?

The core stack includes PyTorch (Core tensor operations, automatic differentiation, and model execution across CPUs and GPUs with support for distributed training), HuggingFace Transformers (Pre-trained model loading, tokenization, and the Trainer framework that orchestrates the training loop with logging and checkpointing), PEFT (Parameter-efficient fine-tuning methods including LoRA, QLoRA, and AdaLoRA that wrap base models with trainable adapters), bitsandbytes (4-bit and 8-bit quantization for memory-efficient training and inference of large models), Accelerate (Multi-GPU and distributed training coordination with automatic mixed precision and gradient synchronization), FastAPI (HTTP API server providing OpenAI-compatible endpoints for chat completions and model serving with async support), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does LlamaFactory have?

LlamaFactory exhibits 4 data pools (Model Registry, Checkpoint Directory), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does LlamaFactory use?

5 design patterns detected: Parameter-Efficient Training, Template-Based Conversation Formatting, Multi-Method Training Pipeline, Extensible Model Registry, Multimodal Data Integration.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.