haotian-liu/llava

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

24,698 stars Python 8 components

Trains and evaluates vision-language models that answer questions about images using instruction tuning

Training begins by loading conversation datasets where each example pairs text instructions with images. The LazySupervisedDataset streams these conversations, applies image preprocessing through CLIP transforms, and formats them into instruction-following sequences. During forward pass, images flow through the vision encoder while text gets tokenized, then a learnable projection layer maps visual features into the language model's embedding space. The model generates next-token predictions autoregressively, with loss computed only on assistant responses (human tokens masked with IGNORE_INDEX). For inference, the same pipeline processes single conversations through the model to generate responses.

Under the hood, the system uses 2 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.

A 8-component ml training. 62 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Load conversation data — LazySupervisedDataset reads JSON files containing conversations with image paths, applies lazy loading to handle large datasets efficiently [DataArguments] (config: data_path, image_folder, lazy_preprocess)
Preprocess images — process_images function loads PIL images, applies CLIP-style transforms (resize, normalize), and converts to tensors matching vision encoder expectations (config: image_aspect_ratio)
Format conversation — Conversation class applies chat templates (v1, llama2, etc), inserts special tokens (<image>, <im_start>, <im_end>), and creates training targets by masking human tokens [Conversation] (config: version, mm_use_im_start_end, mm_use_im_patch_token)
Tokenize with image placeholders — tokenizer_image_token converts text to token IDs while preserving IMAGE_TOKEN_INDEX (-200) positions for visual feature insertion (config: model_max_length)
Forward pass — LlavaLlamaForCausalLM processes images through CLIP encoder, projects visual features via mm_projector, embeds text tokens, and runs through language model [ConversationBatch] (config: mm_vision_select_layer, mm_projector_type, vision_tower)
Compute loss — Model calculates cross-entropy loss only on assistant response tokens (positions where labels != IGNORE_INDEX), enabling instruction following
Parameter update — LlavaTrainer applies gradients through optimizer, supports LoRA for efficient fine-tuning and quantization for memory efficiency (config: lora_enable, bits, freeze_mm_mlp_adapter +1)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ConversationBatch llava/train/train.py
dict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len], images: Tensor[B, C, H, W] where masked positions use IGNORE_INDEX (-100) for loss computation
Created by preprocessing conversations with images, consumed by model forward pass for loss calculation

Conversation llava/conversation.py
dataclass with system: str, roles: List[str], messages: List[List[str]], offset: int, sep_style: SeparatorStyle, version: str defining conversation format and turn structure
Template loaded from predefined formats, populated with user/assistant turns, rendered to text prompt

QuestionAnswer llava/eval/model_vqa.py
dict with question_id: str, question: str, image: PIL.Image, answer_choices: Optional[List[str]] for evaluation datasets
Loaded from benchmark JSON files, processed through model, scored against ground truth

ModelArguments llava/train/train.py
dataclass with model_name_or_path: str, vision_tower: str, mm_projector_type: str, mm_vision_select_layer: int controlling model architecture choices
Parsed from command line args, used to configure model components and projection layers

TrainingArguments llava/train/train.py
HuggingFace TrainingArguments extended with mm_vision_select_layer: int, freeze_mm_mlp_adapter: bool, model_max_length: int, bits: int for quantization and multimodal training
Configuration object that controls all aspects of training including optimization, data loading, and model saving

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

All images in the dataset have compatible dimensions after CLIP preprocessing - the process_images function expects PIL Images that can be resized and normalized, but never validates image format, color channels, or handles corrupted files

If this fails: If dataset contains grayscale images, CMYK images, or corrupted image files, the CLIP transforms will fail with cryptic tensor shape errors during training, potentially hours into a training run

llava/train/train.py:LazySupervisedDataset

critical Domain unguarded

The special token IMAGE_TOKEN_INDEX (-200) will never collide with actual vocabulary tokens from the tokenizer - the tokenizer_image_token function uses this hardcoded value to mark image positions without checking tokenizer vocabulary bounds

If this fails: If a tokenizer has vocabulary IDs that include -200 (or if token IDs are remapped), image tokens could overwrite real text tokens, causing the model to inject image features at wrong positions and produce garbled outputs

llava/constants.py:IMAGE_TOKEN_INDEX

critical Ordering weakly guarded

Conversation messages follow strict alternating user/assistant structure where messages[0][1] is always a tuple when images are present - the code checks type(messages[0][1]) is tuple to detect image conversations but doesn't validate the full conversation structure

If this fails: If conversation data has non-alternating turns, missing roles, or inconsistent message formats, the model will generate prompts with wrong speaker attribution, leading to confused training where the model learns incorrect turn-taking patterns

llava/conversation.py:get_prompt

critical Scale weakly guarded

All conversation sequences will fit within model_max_length tokens after image token insertion - the tokenizer truncates sequences but doesn't account for IMAGE_TOKEN_INDEX placeholders that get replaced with variable-length visual features

If this fails: Long conversations with images could exceed context window after visual feature insertion, causing CUDA out-of-memory errors or silently truncated visual information that breaks image-text alignment

llava/train/train.py:model_max_length

warning Resource weakly guarded

OpenAI API will always eventually succeed after rate limit retries - the infinite retry loop with time.sleep(NUM_SECONDS_TO_SLEEP) assumes temporary rate limits but doesn't handle permanent failures, quota exhaustion, or API key revocation

If this fails: Evaluation scripts can hang indefinitely if OpenAI API keys are invalid, quota is exceeded, or service is discontinued, blocking automated evaluation pipelines without clear error messages

llava/eval/eval_gpt_review.py:get_eval

warning Environment unguarded

Vision encoder and language model checkpoints are compatible with current CUDA/PyTorch environment - the model loading doesn't validate device compatibility, precision requirements, or CUDA compute capability

If this fails: Models could load successfully but fail during forward pass with CUDA version mismatches, unsupported operations on older GPUs, or precision errors that manifest as NaN losses

llava/model/builder.py:load_pretrained_model

warning Contract unguarded

CLIP image preprocessor transforms produce tensors with expected shape (3, 336, 336) that match the vision encoder input requirements - no validation that image_processor output dimensions align with vision_tower expected input

If this fails: If CLIP model configurations mismatch between preprocessing and vision encoder (different image sizes, normalization), the model silently processes wrong-shaped inputs, leading to degraded performance that's hard to diagnose

llava/mm_utils.py:process_images

warning Temporal unguarded

Dataset files remain unchanged during training - LazySupervisedDataset caches file paths but doesn't check file modification times or handle concurrent writes to image directories

If this fails: If training data is modified during long training runs (images replaced, JSON files updated), the trainer could load mismatched images and conversation data, corrupting batch consistency

llava/train/train.py:LlavaTrainer

warning Domain weakly guarded

Model responses can be reliably classified as yes/no by checking for presence of 'No', 'not', 'no' keywords in the first sentence - this simple heuristic doesn't handle nuanced responses, multiple sentences, or complex negations

If this fails: Evaluation scores become unreliable when models generate nuanced answers like 'I cannot determine' or 'It appears unlikely', leading to incorrect benchmark results that don't reflect true model capability

llava/eval/eval_pope.py:eval_pope

info Scale weakly guarded

Loss masking with IGNORE_INDEX (-100) correctly excludes human tokens from gradient computation - assumes PyTorch CrossEntropyLoss handles -100 as intended without checking for label preprocessing edge cases

If this fails: If label preprocessing creates unexpected values or if loss function behavior changes between PyTorch versions, human tokens might contribute to loss, degrading instruction-following capability

llava/constants.py:IGNORE_INDEX

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model checkpoints (checkpoint)
Saved model states including vision encoder, projection layers, and language model weights at training intervals

Conversation templates (registry)
Predefined chat formats (v0, v1, llama2, mpt) that control prompt structure and special token usage

Evaluation cache (file-store)
Generated predictions stored as JSONL for benchmark comparison and GPT-4 evaluation

Feedback Loops

Training loop (training-loop, reinforcing) — Trigger: Batch loss computation. Action: LlavaTrainer computes gradients, updates parameters via AdamW, logs metrics, saves checkpoints. Exit: Max steps or convergence reached.
Evaluation iteration (convergence, balancing) — Trigger: Model checkpoint creation. Action: eval_model runs inference on benchmark datasets, computes accuracy metrics, compares against baselines. Exit: All evaluation datasets processed.

Delays

Lazy data loading (async-processing, ~Variable based on image I/O) — Dataset only loads and processes images when requested during training iteration
Distributed synchronization (eventual-consistency, ~Per training step) — Gradient synchronization across multiple GPUs introduces communication overhead
Vision encoder warmup (warmup, ~Initial forward passes) — CLIP model requires several iterations to optimize CUDA kernels and memory allocation

Control Points

Model architecture switch (architecture-switch) — Controls: Choice of language model backbone (Llama, Vicuna, MPT) and vision encoder. Default: model_name_or_path
Quantization precision (precision-mode) — Controls: Model precision (4-bit, 8-bit, fp16) affecting memory usage and training speed. Default: bits parameter
Vision layer selection (hyperparameter) — Controls: Which CLIP layer features to extract (-1 for final layer, -2 for penultimate). Default: mm_vision_select_layer
Training stage control (feature-flag) — Controls: Whether to freeze vision encoder, tune only projection layer, or enable LoRA. Default: freeze_backbone, tune_mm_mlp_adapter, lora_enable
Conversation format (runtime-toggle) — Controls: Chat template version affecting prompt structure and model behavior. Default: version parameter

Technology Stack

PyTorch (framework)
Core tensor computation and model training with CUDA acceleration for vision-language model training

HuggingFace Transformers (library)
Pretrained language model loading, tokenization, and training infrastructure with distributed support

CLIP (library)
Vision encoder that extracts visual features from images for projection into language model space

DeepSpeed (library)
Distributed training optimization with ZeRO stages for large model training across multiple GPUs

LoRA/PEFT (library)
Parameter-efficient fine-tuning enabling model adaptation with reduced memory requirements

Gradio (framework)
Web interface for model demos and interactive evaluation of vision-language capabilities

FastAPI (framework)
REST API server for model serving with async request handling and worker management

Key Components

LlavaLlamaForCausalLM (processor) — Core multimodal model combining CLIP vision encoder with Llama language model through projection layers for visual question answering llava/model/llava_llama.py
LlavaTrainer (orchestrator) — Extended HuggingFace trainer handling multimodal data preprocessing, conversation formatting, and distributed training coordination llava/train/train.py
LazySupervisedDataset (loader) — Dataset loader that streams conversation data with images, applies transformations lazily, and formats for instruction tuning llava/train/train.py
load_pretrained_model (factory) — Model factory that initializes pretrained vision-language model with proper tokenizer, image processor, and device placement llava/model/builder.py
eval_model (processor) — Evaluation orchestrator that runs model inference on benchmark datasets, manages conversation formatting, and generates predictions llava/eval/model_vqa.py
Conversation (formatter) — Conversation manager that handles turn-taking, special tokens, and prompt formatting for different model versions and chat templates llava/conversation.py
process_images (transformer) — Image preprocessor that handles resizing, normalization, and tensor conversion for vision encoder input llava/mm_utils.py
tokenizer_image_token (encoder) — Text tokenizer that handles special image tokens, preserves image positions, and manages sequence length constraints llava/mm_utils.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is LLaVA used for?

Trains and evaluates vision-language models that answer questions about images using instruction tuning haotian-liu/llava is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 62 files.

How is LLaVA architected?

LLaVA is organized into 4 architecture layers: Model Core, Training Pipeline, Evaluation Suite, Serving Infrastructure. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through LLaVA?

Data moves through 7 stages: Load conversation data → Preprocess images → Format conversation → Tokenize with image placeholders → Forward pass → .... Training begins by loading conversation datasets where each example pairs text instructions with images. The LazySupervisedDataset streams these conversations, applies image preprocessing through CLIP transforms, and formats them into instruction-following sequences. During forward pass, images flow through the vision encoder while text gets tokenized, then a learnable projection layer maps visual features into the language model's embedding space. The model generates next-token predictions autoregressively, with loss computed only on assistant responses (human tokens masked with IGNORE_INDEX). For inference, the same pipeline processes single conversations through the model to generate responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does LLaVA use?

The core stack includes PyTorch (Core tensor computation and model training with CUDA acceleration for vision-language model training), HuggingFace Transformers (Pretrained language model loading, tokenization, and training infrastructure with distributed support), CLIP (Vision encoder that extracts visual features from images for projection into language model space), DeepSpeed (Distributed training optimization with ZeRO stages for large model training across multiple GPUs), LoRA/PEFT (Parameter-efficient fine-tuning enabling model adaptation with reduced memory requirements), Gradio (Web interface for model demos and interactive evaluation of vision-language capabilities), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does LLaVA have?

LLaVA exhibits 3 data pools (Model checkpoints, Conversation templates), 2 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does LLaVA use?

4 design patterns detected: Instruction Tuning, Multimodal Fusion, Stage-wise Training, Lazy Data Loading.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.