haotian-liu/llava
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Visual instruction tuning framework building multimodal LLMs with GPT-4V capabilities
Images and text flow through vision encoder, multimodal projector, and language model for instruction following
Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
Structural Verdict
A 10-component ml training with 4 connections. 62 files analyzed. Loosely coupled — components are relatively independent.
How Data Flows Through the System
Images and text flow through vision encoder, multimodal projector, and language model for instruction following
- Load Data — Load image-text instruction pairs from configured datasets (config: data_path, image_folder, lazy_preprocess)
- Process Images — Encode images through vision tower and project to language model dimension (config: vision_tower, mm_vision_select_layer, mm_projector_type)
- Format Conversation — Apply conversation templates with special tokens for multimodal input (config: version, mm_use_im_start_end, mm_use_im_patch_token)
- Train Model — Fine-tune language model with multimodal inputs and instruction targets (config: model_name_or_path, freeze_backbone, tune_mm_mlp_adapter +1)
- Evaluate — Test on visual reasoning benchmarks with automated GPT-4 scoring
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Active worker models and their capabilities
Chat message history and context
Processed image tensors
Feedback Loops
- Worker Heartbeat (polling, balancing) — Trigger: Timer every 15 seconds. Action: Workers send status to controller. Exit: Worker shutdown.
- GPT-4 Retry (retry, balancing) — Trigger: Rate limit or API error. Action: Sleep and retry API call. Exit: Successful response or max retries.
- Training Loop (training-loop, reinforcing) — Trigger: Start training. Action: Forward pass, compute loss, backprop. Exit: Max steps reached.
Delays & Async Processing
- Heart Beat Expiration (cache-ttl, ~30 seconds) — Worker marked as inactive
- API Rate Limiting (rate-limit, ~0.5-3 seconds) — Evaluation pipeline pauses
- Image Processing (async-processing) — Latency in multimodal inference
Control Points
- Vision Tower Selection (env-var) — Controls: Which vision encoder to use. Default: vision_tower
- Conversation Version (runtime-toggle) — Controls: Chat format and special tokens. Default: version
- Temperature (threshold) — Controls: GPT-4 evaluation randomness. Default: 0.2
- Freeze Backbone (feature-flag) — Controls: Whether to freeze language model weights. Default: freeze_backbone
Technology Stack
Deep learning framework
Model architecture and tokenization
Web interface for demos
API serving backend
Distributed evaluation processing
Training optimization
Parameter efficient fine-tuning
GPT-4 evaluation
Image processing
Data manipulation for benchmarks
Key Components
- Conversation (class) — Manages conversation history and prompt formatting with different separator styles
llava/conversation.py - load_pretrained_model (function) — Factory function to load and initialize pretrained LLaVA models with vision components
llava/model/builder.py - LlavaLlamaForCausalLM (class) — Core multimodal architecture combining Llama language model with vision tower
llava/model/llava_arch.py - LLaVATrainer (class) — Custom trainer for multimodal instruction tuning with vision-language data
llava/train/llava_trainer.py - eval_model (function) — Evaluates LLaVA models on visual question answering benchmarks
llava/eval/model_vqa.py - get_eval (function) — Uses GPT-4 to evaluate model responses with automatic retry logic
llava/eval/eval_gpt_review.py - process_images (function) — Preprocesses images for model input with proper aspect ratio handling
llava/mm_utils.py - tokenizer_image_token (function) — Tokenizes text with special image tokens for multimodal processing
llava/mm_utils.py - ControllerManager (class) — Manages worker nodes and load balancing for distributed serving
llava/serve/controller.py - build_demo (function) — Creates Gradio web interface for interactive multimodal conversations
llava/serve/gradio_web_server.py
Configuration
cog.yaml (yaml)
build.gpu(boolean, unknown) — default: truebuild.python_version(string, unknown) — default: 3.11build.python_packages(array, unknown) — default: torch==2.0.1,accelerate==0.21.0,bitsandbytes==0.41.0,deepspeed==0.9.5,einops-exts==0.0.4,einops==0.6.1,gradio==3.35.2,gradio_client==0.2.9,httpx==0.24.0,markdown2==2.4.10,numpy==1.26.0,peft==0.4.0,scikit-learn==1.2.2,sentencepiece==0.1.99,shortuuid==1.0.11,timm==0.6.13,tokenizers==0.13.3,torch==2.0.1,torchvision==0.15.2,transformers==4.31.0,wandb==0.15.12,wavedrom==2.0.3.post3,Pygments==2.16.1build.run(array, unknown) — default: curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.0.3/pget" && chmod +x /usr/local/bin/pgetpredict(string, unknown) — default: predict.py:Predictor
llava/conversation.py (python-dataclass)
system(str, unknown)roles(List[str], unknown)messages(List[List[str]], unknown)offset(int, unknown)sep_style(SeparatorStyle, unknown) — default: SeparatorStyle.SINGLEsep(str, unknown) — default: "###"sep2(str, unknown) — default: Noneversion(str, unknown) — default: "Unknown"
llava/serve/controller.py (python-dataclass)
model_names(List[str], unknown)speed(int, unknown)queue_length(int, unknown)check_heart_beat(bool, unknown)last_heart_beat(str, unknown)
llava/train/train.py (python-dataclass)
model_name_or_path(Optional[str], unknown) — default: field(default="facebook/opt-125m")version(Optional[str], unknown) — default: field(default="v0")freeze_backbone(bool, unknown) — default: field(default=False)tune_mm_mlp_adapter(bool, unknown) — default: field(default=False)vision_tower(Optional[str], unknown) — default: field(default=None)mm_vision_select_layer(Optional[int], unknown) — default: field(default=-1) # default to the last layerpretrain_mm_mlp_adapter(Optional[str], unknown) — default: field(default=None)mm_projector_type(Optional[str], unknown) — default: field(default='linear')- +4 more parameters
Science Pipeline
- Load Images — PIL.Image.open then convert RGB [Variable (H, W, 3) → (3, H, W)]
llava/mm_utils.py - Vision Encoding — Vision tower forward pass with layer selection [(3, 336, 336) → (576, 1024)]
llava/model/multimodal_encoder.py - Multimodal Projection — Linear/MLP projection to language model dimension [(576, 1024) → (576, 4096)]
llava/model/multimodal_projector.py - Token Integration — Replace image tokens with vision features in text sequence [(seq_len, 4096) + (576, 4096) → (seq_len + 575, 4096)]
llava/model/llava_arch.py - Language Generation — Autoregressive generation with multimodal context [(seq_len + 575, 4096) → (output_len, vocab_size)]
llava/model/llava_arch.py
Assumptions & Constraints
- [warning] Assumes images are PIL Image objects but no type checking enforced (shape)
- [critical] Assumes input_ids tensor shape matches image token positions without validation (shape)
- [info] Assumes image_aspect_ratio 'square' produces consistent dimensions across datasets (format)
- [warning] Vision tower output dimension must match projector input but no runtime check (dependency)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is LLaVA used for?
Visual instruction tuning framework building multimodal LLMs with GPT-4V capabilities haotian-liu/llava is a 10-component ml training written in Python. Loosely coupled — components are relatively independent. The codebase contains 62 files.
How is LLaVA architected?
LLaVA is organized into 4 architecture layers: Model Layer, Training Layer, Evaluation Layer, Serving Layer. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.
How does data flow through LLaVA?
Data moves through 5 stages: Load Data → Process Images → Format Conversation → Train Model → Evaluate. Images and text flow through vision encoder, multimodal projector, and language model for instruction following This pipeline design reflects a complex multi-stage processing system.
What technologies does LLaVA use?
The core stack includes PyTorch (Deep learning framework), Transformers (Model architecture and tokenization), Gradio (Web interface for demos), FastAPI (API serving backend), Ray (Distributed evaluation processing), DeepSpeed (Training optimization), and 4 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does LLaVA have?
LLaVA exhibits 3 data pools (Model Registry, Conversation History), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle polling and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does LLaVA use?
5 design patterns detected: Conversation Templates, Multimodal Tokenization, Distributed Evaluation, Modular Architecture, Configuration Dataclasses.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.