haotian-liu/llava
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Trains and evaluates vision-language models that answer questions about images using instruction tuning
Training begins by loading conversation datasets where each example pairs text instructions with images. The LazySupervisedDataset streams these conversations, applies image preprocessing through CLIP transforms, and formats them into instruction-following sequences. During forward pass, images flow through the vision encoder while text gets tokenized, then a learnable projection layer maps visual features into the language model's embedding space. The model generates next-token predictions autoregressively, with loss computed only on assistant responses (human tokens masked with IGNORE_INDEX). For inference, the same pipeline processes single conversations through the model to generate responses.
Under the hood, the system uses 2 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.
A 8-component ml training. 62 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Training begins by loading conversation datasets where each example pairs text instructions with images. The LazySupervisedDataset streams these conversations, applies image preprocessing through CLIP transforms, and formats them into instruction-following sequences. During forward pass, images flow through the vision encoder while text gets tokenized, then a learnable projection layer maps visual features into the language model's embedding space. The model generates next-token predictions autoregressively, with loss computed only on assistant responses (human tokens masked with IGNORE_INDEX). For inference, the same pipeline processes single conversations through the model to generate responses.
- Load conversation data — LazySupervisedDataset reads JSON files containing conversations with image paths, applies lazy loading to handle large datasets efficiently [DataArguments] (config: data_path, image_folder, lazy_preprocess)
- Preprocess images — process_images function loads PIL images, applies CLIP-style transforms (resize, normalize), and converts to tensors matching vision encoder expectations (config: image_aspect_ratio)
- Format conversation — Conversation class applies chat templates (v1, llama2, etc), inserts special tokens (<image>, <im_start>, <im_end>), and creates training targets by masking human tokens [Conversation] (config: version, mm_use_im_start_end, mm_use_im_patch_token)
- Tokenize with image placeholders — tokenizer_image_token converts text to token IDs while preserving IMAGE_TOKEN_INDEX (-200) positions for visual feature insertion (config: model_max_length)
- Forward pass — LlavaLlamaForCausalLM processes images through CLIP encoder, projects visual features via mm_projector, embeds text tokens, and runs through language model [ConversationBatch] (config: mm_vision_select_layer, mm_projector_type, vision_tower)
- Compute loss — Model calculates cross-entropy loss only on assistant response tokens (positions where labels != IGNORE_INDEX), enabling instruction following
- Parameter update — LlavaTrainer applies gradients through optimizer, supports LoRA for efficient fine-tuning and quantization for memory efficiency (config: lora_enable, bits, freeze_mm_mlp_adapter +1)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
llava/train/train.pydict with input_ids: Tensor[B, seq_len], attention_mask: Tensor[B, seq_len], labels: Tensor[B, seq_len], images: Tensor[B, C, H, W] where masked positions use IGNORE_INDEX (-100) for loss computation
Created by preprocessing conversations with images, consumed by model forward pass for loss calculation
llava/conversation.pydataclass with system: str, roles: List[str], messages: List[List[str]], offset: int, sep_style: SeparatorStyle, version: str defining conversation format and turn structure
Template loaded from predefined formats, populated with user/assistant turns, rendered to text prompt
llava/eval/model_vqa.pydict with question_id: str, question: str, image: PIL.Image, answer_choices: Optional[List[str]] for evaluation datasets
Loaded from benchmark JSON files, processed through model, scored against ground truth
llava/train/train.pydataclass with model_name_or_path: str, vision_tower: str, mm_projector_type: str, mm_vision_select_layer: int controlling model architecture choices
Parsed from command line args, used to configure model components and projection layers
llava/train/train.pyHuggingFace TrainingArguments extended with mm_vision_select_layer: int, freeze_mm_mlp_adapter: bool, model_max_length: int, bits: int for quantization and multimodal training
Configuration object that controls all aspects of training including optimization, data loading, and model saving
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
All images in the dataset have compatible dimensions after CLIP preprocessing - the process_images function expects PIL Images that can be resized and normalized, but never validates image format, color channels, or handles corrupted files
If this fails: If dataset contains grayscale images, CMYK images, or corrupted image files, the CLIP transforms will fail with cryptic tensor shape errors during training, potentially hours into a training run
llava/train/train.py:LazySupervisedDataset
The special token IMAGE_TOKEN_INDEX (-200) will never collide with actual vocabulary tokens from the tokenizer - the tokenizer_image_token function uses this hardcoded value to mark image positions without checking tokenizer vocabulary bounds
If this fails: If a tokenizer has vocabulary IDs that include -200 (or if token IDs are remapped), image tokens could overwrite real text tokens, causing the model to inject image features at wrong positions and produce garbled outputs
llava/constants.py:IMAGE_TOKEN_INDEX
Conversation messages follow strict alternating user/assistant structure where messages[0][1] is always a tuple when images are present - the code checks type(messages[0][1]) is tuple to detect image conversations but doesn't validate the full conversation structure
If this fails: If conversation data has non-alternating turns, missing roles, or inconsistent message formats, the model will generate prompts with wrong speaker attribution, leading to confused training where the model learns incorrect turn-taking patterns
llava/conversation.py:get_prompt
All conversation sequences will fit within model_max_length tokens after image token insertion - the tokenizer truncates sequences but doesn't account for IMAGE_TOKEN_INDEX placeholders that get replaced with variable-length visual features
If this fails: Long conversations with images could exceed context window after visual feature insertion, causing CUDA out-of-memory errors or silently truncated visual information that breaks image-text alignment
llava/train/train.py:model_max_length
OpenAI API will always eventually succeed after rate limit retries - the infinite retry loop with time.sleep(NUM_SECONDS_TO_SLEEP) assumes temporary rate limits but doesn't handle permanent failures, quota exhaustion, or API key revocation
If this fails: Evaluation scripts can hang indefinitely if OpenAI API keys are invalid, quota is exceeded, or service is discontinued, blocking automated evaluation pipelines without clear error messages
llava/eval/eval_gpt_review.py:get_eval
Vision encoder and language model checkpoints are compatible with current CUDA/PyTorch environment - the model loading doesn't validate device compatibility, precision requirements, or CUDA compute capability
If this fails: Models could load successfully but fail during forward pass with CUDA version mismatches, unsupported operations on older GPUs, or precision errors that manifest as NaN losses
llava/model/builder.py:load_pretrained_model
CLIP image preprocessor transforms produce tensors with expected shape (3, 336, 336) that match the vision encoder input requirements - no validation that image_processor output dimensions align with vision_tower expected input
If this fails: If CLIP model configurations mismatch between preprocessing and vision encoder (different image sizes, normalization), the model silently processes wrong-shaped inputs, leading to degraded performance that's hard to diagnose
llava/mm_utils.py:process_images
Dataset files remain unchanged during training - LazySupervisedDataset caches file paths but doesn't check file modification times or handle concurrent writes to image directories
If this fails: If training data is modified during long training runs (images replaced, JSON files updated), the trainer could load mismatched images and conversation data, corrupting batch consistency
llava/train/train.py:LlavaTrainer
Model responses can be reliably classified as yes/no by checking for presence of 'No', 'not', 'no' keywords in the first sentence - this simple heuristic doesn't handle nuanced responses, multiple sentences, or complex negations
If this fails: Evaluation scores become unreliable when models generate nuanced answers like 'I cannot determine' or 'It appears unlikely', leading to incorrect benchmark results that don't reflect true model capability
llava/eval/eval_pope.py:eval_pope
Loss masking with IGNORE_INDEX (-100) correctly excludes human tokens from gradient computation - assumes PyTorch CrossEntropyLoss handles -100 as intended without checking for label preprocessing edge cases
If this fails: If label preprocessing creates unexpected values or if loss function behavior changes between PyTorch versions, human tokens might contribute to loss, degrading instruction-following capability
llava/constants.py:IGNORE_INDEX
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Saved model states including vision encoder, projection layers, and language model weights at training intervals
Predefined chat formats (v0, v1, llama2, mpt) that control prompt structure and special token usage
Generated predictions stored as JSONL for benchmark comparison and GPT-4 evaluation
Feedback Loops
- Training loop (training-loop, reinforcing) — Trigger: Batch loss computation. Action: LlavaTrainer computes gradients, updates parameters via AdamW, logs metrics, saves checkpoints. Exit: Max steps or convergence reached.
- Evaluation iteration (convergence, balancing) — Trigger: Model checkpoint creation. Action: eval_model runs inference on benchmark datasets, computes accuracy metrics, compares against baselines. Exit: All evaluation datasets processed.
Delays
- Lazy data loading (async-processing, ~Variable based on image I/O) — Dataset only loads and processes images when requested during training iteration
- Distributed synchronization (eventual-consistency, ~Per training step) — Gradient synchronization across multiple GPUs introduces communication overhead
- Vision encoder warmup (warmup, ~Initial forward passes) — CLIP model requires several iterations to optimize CUDA kernels and memory allocation
Control Points
- Model architecture switch (architecture-switch) — Controls: Choice of language model backbone (Llama, Vicuna, MPT) and vision encoder. Default: model_name_or_path
- Quantization precision (precision-mode) — Controls: Model precision (4-bit, 8-bit, fp16) affecting memory usage and training speed. Default: bits parameter
- Vision layer selection (hyperparameter) — Controls: Which CLIP layer features to extract (-1 for final layer, -2 for penultimate). Default: mm_vision_select_layer
- Training stage control (feature-flag) — Controls: Whether to freeze vision encoder, tune only projection layer, or enable LoRA. Default: freeze_backbone, tune_mm_mlp_adapter, lora_enable
- Conversation format (runtime-toggle) — Controls: Chat template version affecting prompt structure and model behavior. Default: version parameter
Technology Stack
Core tensor computation and model training with CUDA acceleration for vision-language model training
Pretrained language model loading, tokenization, and training infrastructure with distributed support
Vision encoder that extracts visual features from images for projection into language model space
Distributed training optimization with ZeRO stages for large model training across multiple GPUs
Parameter-efficient fine-tuning enabling model adaptation with reduced memory requirements
Web interface for model demos and interactive evaluation of vision-language capabilities
REST API server for model serving with async request handling and worker management
Key Components
- LlavaLlamaForCausalLM (processor) — Core multimodal model combining CLIP vision encoder with Llama language model through projection layers for visual question answering
llava/model/llava_llama.py - LlavaTrainer (orchestrator) — Extended HuggingFace trainer handling multimodal data preprocessing, conversation formatting, and distributed training coordination
llava/train/train.py - LazySupervisedDataset (loader) — Dataset loader that streams conversation data with images, applies transformations lazily, and formats for instruction tuning
llava/train/train.py - load_pretrained_model (factory) — Model factory that initializes pretrained vision-language model with proper tokenizer, image processor, and device placement
llava/model/builder.py - eval_model (processor) — Evaluation orchestrator that runs model inference on benchmark datasets, manages conversation formatting, and generates predictions
llava/eval/model_vqa.py - Conversation (formatter) — Conversation manager that handles turn-taking, special tokens, and prompt formatting for different model versions and chat templates
llava/conversation.py - process_images (transformer) — Image preprocessor that handles resizing, normalization, and tensor conversion for vision encoder input
llava/mm_utils.py - tokenizer_image_token (encoder) — Text tokenizer that handles special image tokens, preserves image positions, and manages sequence length constraints
llava/mm_utils.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is LLaVA used for?
Trains and evaluates vision-language models that answer questions about images using instruction tuning haotian-liu/llava is a 8-component ml training written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 62 files.
How is LLaVA architected?
LLaVA is organized into 4 architecture layers: Model Core, Training Pipeline, Evaluation Suite, Serving Infrastructure. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through LLaVA?
Data moves through 7 stages: Load conversation data → Preprocess images → Format conversation → Tokenize with image placeholders → Forward pass → .... Training begins by loading conversation datasets where each example pairs text instructions with images. The LazySupervisedDataset streams these conversations, applies image preprocessing through CLIP transforms, and formats them into instruction-following sequences. During forward pass, images flow through the vision encoder while text gets tokenized, then a learnable projection layer maps visual features into the language model's embedding space. The model generates next-token predictions autoregressively, with loss computed only on assistant responses (human tokens masked with IGNORE_INDEX). For inference, the same pipeline processes single conversations through the model to generate responses. This pipeline design reflects a complex multi-stage processing system.
What technologies does LLaVA use?
The core stack includes PyTorch (Core tensor computation and model training with CUDA acceleration for vision-language model training), HuggingFace Transformers (Pretrained language model loading, tokenization, and training infrastructure with distributed support), CLIP (Vision encoder that extracts visual features from images for projection into language model space), DeepSpeed (Distributed training optimization with ZeRO stages for large model training across multiple GPUs), LoRA/PEFT (Parameter-efficient fine-tuning enabling model adaptation with reduced memory requirements), Gradio (Web interface for model demos and interactive evaluation of vision-language capabilities), and 1 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does LLaVA have?
LLaVA exhibits 3 data pools (Model checkpoints, Conversation templates), 2 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does LLaVA use?
4 design patterns detected: Instruction Tuning, Multimodal Fusion, Stage-wise Training, Lazy Data Loading.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.