Hidden Assumptions in LLaVA

12 assumptions this code never checks · 4 critical · spanning Shape, Domain, Ordering, Scale, Resource, Environment, Contract, Temporal

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at haotian-liu/llava and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

If dataset contains grayscale images, CMYK images, or corrupted image files, the CLIP transforms will fail with cryptic tensor shape errors during training, potentially hours into a training run

Worth your attention first

If a tokenizer has vocabulary IDs that include -200 (or if token IDs are remapped), image tokens could overwrite real text tokens, causing the model to inject image features at wrong positions and produce garbled outputs

Worth your attention first

If conversation data has non-alternating turns, missing roles, or inconsistent message formats, the model will generate prompts with wrong speaker attribution, leading to confused training where the model learns incorrect turn-taking patterns

Show everything (9 more)

Scale

All conversation sequences will fit within model_max_length tokens after image token insertion - the tokenizer truncates sequences but doesn't account for IMAGE_TOKEN_INDEX placeholders that get replaced with variable-length visual features

If this fails: Long conversations with images could exceed context window after visual feature insertion, causing CUDA out-of-memory errors or silently truncated visual information that breaks image-text alignment

llava/train/train.py:model_max_length

Resource

OpenAI API will always eventually succeed after rate limit retries - the infinite retry loop with time.sleep(NUM_SECONDS_TO_SLEEP) assumes temporary rate limits but doesn't handle permanent failures, quota exhaustion, or API key revocation

If this fails: Evaluation scripts can hang indefinitely if OpenAI API keys are invalid, quota is exceeded, or service is discontinued, blocking automated evaluation pipelines without clear error messages

llava/eval/eval_gpt_review.py:get_eval

Environment

Vision encoder and language model checkpoints are compatible with current CUDA/PyTorch environment - the model loading doesn't validate device compatibility, precision requirements, or CUDA compute capability

If this fails: Models could load successfully but fail during forward pass with CUDA version mismatches, unsupported operations on older GPUs, or precision errors that manifest as NaN losses

llava/model/builder.py:load_pretrained_model

Contract

CLIP image preprocessor transforms produce tensors with expected shape (3, 336, 336) that match the vision encoder input requirements - no validation that image_processor output dimensions align with vision_tower expected input

If this fails: If CLIP model configurations mismatch between preprocessing and vision encoder (different image sizes, normalization), the model silently processes wrong-shaped inputs, leading to degraded performance that's hard to diagnose

llava/mm_utils.py:process_images

Temporal

Dataset files remain unchanged during training - LazySupervisedDataset caches file paths but doesn't check file modification times or handle concurrent writes to image directories

If this fails: If training data is modified during long training runs (images replaced, JSON files updated), the trainer could load mismatched images and conversation data, corrupting batch consistency

llava/train/train.py:LlavaTrainer

Domain

Model responses can be reliably classified as yes/no by checking for presence of 'No', 'not', 'no' keywords in the first sentence - this simple heuristic doesn't handle nuanced responses, multiple sentences, or complex negations

If this fails: Evaluation scores become unreliable when models generate nuanced answers like 'I cannot determine' or 'It appears unlikely', leading to incorrect benchmark results that don't reflect true model capability

llava/eval/eval_pope.py:eval_pope

Scale

Loss masking with IGNORE_INDEX (-100) correctly excludes human tokens from gradient computation - assumes PyTorch CrossEntropyLoss handles -100 as intended without checking for label preprocessing edge cases

If this fails: If label preprocessing creates unexpected values or if loss function behavior changes between PyTorch versions, human tokens might contribute to loss, degrading instruction-following capability

llava/constants.py:IGNORE_INDEX

Ordering

Multiple choice predictions contain single letter answers (A, B, C, D, E) that can be directly mapped to choice indices - doesn't handle multi-character responses, explanations, or malformed model outputs

If this fails: Models that generate explanatory text or uncertain responses get random choice assignments, artificially inflating or deflating benchmark scores and masking model reasoning capabilities

llava/eval/eval_science_qa.py:get_pred_idx

Contract

All conversation templates (v0, v1, llama2, mpt) use consistent role naming and separator conventions - the get_prompt method assumes roles[0] is always the user role without validation

If this fails: If custom conversation templates have different role orderings or naming, prompts will have incorrect speaker labels, confusing the model about who is asking questions versus providing answers

llava/conversation.py:Conversation

See the full structural analysis of LLaVA: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of haotian-liu/llava →

Frequently Asked Questions

What does LLaVA assume that could break in production?

The one most likely to cause trouble: All images in the dataset have compatible dimensions after CLIP preprocessing - the process_images function expects PIL Images that can be resized and normalized, but never validates image format, color channels, or handles corrupted files If this fails, If dataset contains grayscale images, CMYK images, or corrupted image files, the CLIP transforms will fail with cryptic tensor shape errors during training, potentially hours into a training run

How many hidden assumptions does LLaVA have?

CodeSea found 12 assumptions LLaVA relies on but never validates, 4 of them critical, spanning Shape, Domain, Ordering, Scale, Resource, Environment, Contract, Temporal. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.