haotian-liu/llava

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

24,632 stars Python 10 components 4 connections

Visual instruction tuning framework building multimodal LLMs with GPT-4V capabilities

Images and text flow through vision encoder, multimodal projector, and language model for instruction following

Under the hood, the system uses 3 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.

Structural Verdict

A 10-component ml training with 4 connections. 62 files analyzed. Loosely coupled — components are relatively independent.

How Data Flows Through the System

Images and text flow through vision encoder, multimodal projector, and language model for instruction following

  1. Load Data — Load image-text instruction pairs from configured datasets (config: data_path, image_folder, lazy_preprocess)
  2. Process Images — Encode images through vision tower and project to language model dimension (config: vision_tower, mm_vision_select_layer, mm_projector_type)
  3. Format Conversation — Apply conversation templates with special tokens for multimodal input (config: version, mm_use_im_start_end, mm_use_im_patch_token)
  4. Train Model — Fine-tune language model with multimodal inputs and instruction targets (config: model_name_or_path, freeze_backbone, tune_mm_mlp_adapter +1)
  5. Evaluate — Test on visual reasoning benchmarks with automated GPT-4 scoring

System Behavior

How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Registry (state-store)
Active worker models and their capabilities
Conversation History (in-memory)
Chat message history and context
Image Cache (file-store)
Processed image tensors

Feedback Loops

Delays & Async Processing

Control Points

Technology Stack

PyTorch (framework)
Deep learning framework
Transformers (library)
Model architecture and tokenization
Gradio (framework)
Web interface for demos
FastAPI (framework)
API serving backend
Ray (framework)
Distributed evaluation processing
DeepSpeed (library)
Training optimization
PEFT (library)
Parameter efficient fine-tuning
OpenAI API (library)
GPT-4 evaluation
Pillow (library)
Image processing
Pandas (library)
Data manipulation for benchmarks

Key Components

Configuration

cog.yaml (yaml)

llava/conversation.py (python-dataclass)

llava/serve/controller.py (python-dataclass)

llava/train/train.py (python-dataclass)

Science Pipeline

  1. Load Images — PIL.Image.open then convert RGB [Variable (H, W, 3) → (3, H, W)] llava/mm_utils.py
  2. Vision Encoding — Vision tower forward pass with layer selection [(3, 336, 336) → (576, 1024)] llava/model/multimodal_encoder.py
  3. Multimodal Projection — Linear/MLP projection to language model dimension [(576, 1024) → (576, 4096)] llava/model/multimodal_projector.py
  4. Token Integration — Replace image tokens with vision features in text sequence [(seq_len, 4096) + (576, 4096) → (seq_len + 575, 4096)] llava/model/llava_arch.py
  5. Language Generation — Autoregressive generation with multimodal context [(seq_len + 575, 4096) → (output_len, vocab_size)] llava/model/llava_arch.py

Assumptions & Constraints

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is LLaVA used for?

Visual instruction tuning framework building multimodal LLMs with GPT-4V capabilities haotian-liu/llava is a 10-component ml training written in Python. Loosely coupled — components are relatively independent. The codebase contains 62 files.

How is LLaVA architected?

LLaVA is organized into 4 architecture layers: Model Layer, Training Layer, Evaluation Layer, Serving Layer. Loosely coupled — components are relatively independent. This layered structure keeps concerns separated and modules independent.

How does data flow through LLaVA?

Data moves through 5 stages: Load Data → Process Images → Format Conversation → Train Model → Evaluate. Images and text flow through vision encoder, multimodal projector, and language model for instruction following This pipeline design reflects a complex multi-stage processing system.

What technologies does LLaVA use?

The core stack includes PyTorch (Deep learning framework), Transformers (Model architecture and tokenization), Gradio (Web interface for demos), FastAPI (API serving backend), Ray (Distributed evaluation processing), DeepSpeed (Training optimization), and 4 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does LLaVA have?

LLaVA exhibits 3 data pools (Model Registry, Conversation History), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle polling and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does LLaVA use?

5 design patterns detected: Conversation Templates, Multimodal Tokenization, Distributed Evaluation, Modular Architecture, Configuration Dataclasses.

Analyzed on March 31, 2026 by CodeSea. Written by .