hiyouga/llamafactory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

70,341 stars Python 9 components

Fine-tunes 100+ LLMs and VLMs using efficient methods like LoRA, QLoRA, and full parameter training

Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses.

Under the hood, the system uses 4 feedback loops, 4 data pools, 6 control points to manage its runtime behavior.

A 9-component ml training. 278 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses.

  1. Parse configuration and arguments — The CLI parser in main() processes command line arguments and config files to create ModelArguments, TrainingArguments, DataArguments, and FinetuningArguments dataclasses with validation (config: model_name_or_path, template, finetuning_type +2)
  2. Load and configure base model — load_model_and_tokenizer() loads the specified model with quantization settings, applies flash attention and RoPE scaling, wraps with LoRA adapters if specified, and configures the tokenizer with proper special tokens [ModelArguments → Model] (config: quantization_bit, flash_attn, rope_scaling +2)
  3. Process training datasets — get_dataset() loads datasets from HuggingFace or local files, applies conversation templates through get_template_and_fix_tokenizer(), then preprocess_supervised_dataset() tokenizes conversations and creates training batches with proper label masking [HF Dataset → TrainingBatch] (config: dataset, template, cutoff_len +1)
  4. Execute training loop — CustomSeq2SeqTrainer orchestrates the training with forward passes computing cross-entropy loss on conversation completions, backward passes updating only LoRA parameters via gradient descent, and periodic evaluation on validation sets [TrainingBatch → Trained model] (config: learning_rate, per_device_train_batch_size, gradient_accumulation_steps +1)
  5. Save model and adapters — The trainer saves LoRA adapter weights, merges them with base model if requested, and exports in HuggingFace format with proper configuration files and tokenizer settings [Trained model → Saved checkpoint] (config: output_dir, save_strategy, save_steps)
  6. Serve model for inference — ChatModel loads the fine-tuned model and serves it through FastAPI endpoints, processing ChatMessage objects through conversation templates and generating responses with configurable sampling parameters [ChatMessage → ResponseMetadata] (config: temperature, top_p, max_new_tokens +1)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

TrainingBatch src/llamafactory/data/processor/supervised.py
dict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len] where labels use IGNORE_INDEX (-100) for positions that shouldn't contribute to loss
Created from tokenized conversation data with proper masking for instruction-following tasks, flows through the model for forward/backward passes, then discarded after gradient computation
ModelArguments src/llamafactory/hparams/model_args.py
dataclass with model_name_or_path: str, adapter_name_or_path: Optional[str], template: Optional[str], flash_attn: str, rope_scaling: Optional[str], and 20+ model configuration fields
Parsed from command line or config files, validated for compatibility, then used throughout model loading and training setup to control architecture and optimization settings
ChatMessage src/llamafactory/api/protocol.py
Pydantic model with role: Role (system/user/assistant/tool), content: str or list[dict] for multimodal, tool_calls: Optional[list[ToolCall]], name: Optional[str]
Received from API clients as JSON, validated through Pydantic, converted to internal conversation format via templates, then processed by the model for response generation
MultiModalInputs src/llamafactory/data/mm_plugin.py
dict containing images: Optional[Tensor[B, C, H, W]], videos: Optional[Tensor[B, F, C, H, W]], audio: Optional[Tensor[B, seq_len]], with processor-specific keys for different VLM architectures
Extracted from conversation data containing image/video/audio URLs, processed through architecture-specific multimodal processors, then combined with text tokens for unified model input
TrainingArguments src/llamafactory/hparams/training_args.py
dataclass extending HF TrainingArguments with output_dir: str, learning_rate: float, num_train_epochs: int, per_device_train_batch_size: int, gradient_accumulation_steps: int, plus LoRA-specific and optimization parameters
Assembled from CLI args and config files, validated for hardware constraints and method compatibility, then passed to HuggingFace Trainer to control the entire training process
LoRAConfig src/llamafactory/model/adapter.py
PEFT LoraConfig with r: int (rank), lora_alpha: int, lora_dropout: float, target_modules: list[str], task_type: TaskType, plus method-specific parameters for QLoRA, LongLoRA variants
Constructed from ModelArguments during model loading, used by PEFT library to wrap the base model with efficient low-rank adapters, then controls which parameters receive gradients during training

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Environment unguarded

The API_KEY environment variable, if set, contains a valid bearer token string without parsing or format validation

If this fails: If API_KEY is set to malformed JSON, contains newlines, or uses unexpected encoding, authentication will silently fail with confusing 401 errors instead of clear validation messages

src/llamafactory/api/app.py:create_app
critical Resource unguarded

GPU memory cleanup every 300 seconds is sufficient to prevent out-of-memory crashes regardless of request volume, model size, or batch sizes

If this fails: High-throughput APIs serving large models could accumulate GPU memory faster than the cleanup interval, leading to CUDA OOM errors between sweeps

src/llamafactory/api/app.py:sweeper
warning Domain unguarded

Image URLs (like qianwen-res.oss-cn-beijing.aliyuncs.com) will remain accessible and return images in formats the model processor expects

If this fails: If external image URLs become inaccessible, return 404s, or serve different content types, multimodal inference will fail with cryptic tensor shape errors rather than clear network/format errors

scripts/api_example/test_image.py:messages
warning Contract unguarded

The vocab_size hardcoded as 32768 matches the actual vocabulary size of all models used in benchmarking

If this fails: Benchmarking models with different vocabulary sizes (like 128K vocab models) will generate invalid token IDs, leading to embedding lookup errors or meaningless performance metrics

scripts/bench_qwen.py:DummyDataset.__init__
warning Domain unguarded

All grade inputs will be exactly 'A', 'B', or 'C' strings, and the hours list will have the same length as grades

If this fails: Passing grades like 'A+', 'D', or mismatched list lengths will cause KeyError or index errors instead of graceful validation failures

scripts/api_example/test_toolcall.py:calculate_gpa
critical Ordering unguarded

All .bin files in the input directory contain PyTorch state dict fragments that can be safely merged by updating an OrderedDict

If this fails: If .bin files contain non-tensor data, have conflicting keys, or are corrupted, torch.load will fail or silently create invalid merged state dictionaries

scripts/convert_ckpt/llamafy_baichuan2.py:save_weight
warning Scale unguarded

The hardcoded image token calculation (18 * 18 // (2 * 2)) matches the specific vision encoder architecture being benchmarked

If this fails: Different vision models with other patch sizes or pooling strategies will produce tensor shape mismatches in multimodal forward passes, causing silent incorrect results or crashes

scripts/bench_qwen.py:DummyDataset
info Environment unguarded

URL path structure always follows the pattern '/lang/' and can be safely replaced with string manipulation

If this fails: Complex URL paths, encoded characters, or paths without language prefixes will cause invalid redirects that break navigation or lose query parameters

docs/_static/js/switcher.js:select.addEventListener
warning Contract weakly guarded

Source checkpoint files contain only tensors in formats that safetensors.safe_open() and torch.load() can handle without compatibility issues

If this fails: Mixed checkpoint formats, custom tensor types, or version mismatches between safetensors and PyTorch will cause conversion failures with unclear error messages

scripts/convert_ckpt/llamafy_qwen.py:qwen_state_dict
warning Temporal unguarded

GPU memory cleanup task will continue running throughout the FastAPI application lifecycle without being cancelled or blocked

If this fails: If the cleanup task gets cancelled by asyncio or blocked by long-running operations, memory will accumulate indefinitely until the process crashes

src/llamafactory/api/app.py:lifespan

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Registry (registry)
Maps model names to their configurations, supported features, and template requirements across 100+ model architectures
Checkpoint Directory (file-store)
Accumulates training checkpoints, adapter weights, optimizer states, and training logs throughout the training process
Dataset Cache (cache)
Caches processed datasets and tokenized examples to avoid recomputation across training runs
Model Weight Store (file-store)
Stores base model weights, LoRA adapters, and merged checkpoints in safetensors or PyTorch format

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Core tensor operations, automatic differentiation, and model execution across CPUs and GPUs with support for distributed training
HuggingFace Transformers (framework)
Pre-trained model loading, tokenization, and the Trainer framework that orchestrates the training loop with logging and checkpointing
PEFT (library)
Parameter-efficient fine-tuning methods including LoRA, QLoRA, and AdaLoRA that wrap base models with trainable adapters
bitsandbytes (library)
4-bit and 8-bit quantization for memory-efficient training and inference of large models
Accelerate (library)
Multi-GPU and distributed training coordination with automatic mixed precision and gradient synchronization
FastAPI (framework)
HTTP API server providing OpenAI-compatible endpoints for chat completions and model serving with async support
Gradio (framework)
Web-based user interface for interactive training configuration, real-time monitoring, and model testing
Datasets (library)
Efficient loading and processing of training datasets from HuggingFace Hub and local files with streaming support

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is LlamaFactory used for?

Fine-tunes 100+ LLMs and VLMs using efficient methods like LoRA, QLoRA, and full parameter training hiyouga/llamafactory is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 278 files.

How is LlamaFactory architected?

LlamaFactory is organized into 5 architecture layers: Interface Layer, Configuration Layer, Training Engine, Model Management, and 1 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through LlamaFactory?

Data moves through 6 stages: Parse configuration and arguments → Load and configure base model → Process training datasets → Execute training loop → Save model and adapters → .... Training begins when users specify model, dataset, and training method through CLI or web interface. The system loads the base model with optional quantization and wraps it with efficient adapters (LoRA/QLoRA). Raw datasets are processed through conversation templates, tokenized, and converted to training batches with proper label masking. The training loop alternates between forward passes that compute loss on instruction-following tasks, backward passes that update only the adapter parameters, and periodic evaluation. For inference, the trained model processes chat messages through the same template system to generate responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does LlamaFactory use?

The core stack includes PyTorch (Core tensor operations, automatic differentiation, and model execution across CPUs and GPUs with support for distributed training), HuggingFace Transformers (Pre-trained model loading, tokenization, and the Trainer framework that orchestrates the training loop with logging and checkpointing), PEFT (Parameter-efficient fine-tuning methods including LoRA, QLoRA, and AdaLoRA that wrap base models with trainable adapters), bitsandbytes (4-bit and 8-bit quantization for memory-efficient training and inference of large models), Accelerate (Multi-GPU and distributed training coordination with automatic mixed precision and gradient synchronization), FastAPI (HTTP API server providing OpenAI-compatible endpoints for chat completions and model serving with async support), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does LlamaFactory have?

LlamaFactory exhibits 4 data pools (Model Registry, Checkpoint Directory), 4 feedback loops, 6 control points, 4 delays. The feedback loops handle training-loop and gradient-accumulation. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does LlamaFactory use?

5 design patterns detected: Parameter-Efficient Training, Template-Based Conversation Formatting, Multi-Method Training Pipeline, Extensible Model Registry, Multimodal Data Integration.

Analyzed on April 20, 2026 by CodeSea. Written by .