Hidden Assumptions in LLaVA
12 assumptions this code never checks · 4 critical · spanning Shape, Domain, Ordering, Scale, Resource, Environment, Contract, Temporal
Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at haotian-liu/llava and picked out the few most likely to cause trouble. The full list is just below.
Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".
If dataset contains grayscale images, CMYK images, or corrupted image files, the CLIP transforms will fail with cryptic tensor shape errors during training, potentially hours into a training run
If a tokenizer has vocabulary IDs that include -200 (or if token IDs are remapped), image tokens could overwrite real text tokens, causing the model to inject image features at wrong positions and produce garbled outputs
If conversation data has non-alternating turns, missing roles, or inconsistent message formats, the model will generate prompts with wrong speaker attribution, leading to confused training where the model learns incorrect turn-taking patterns
Show everything (9 more)
All conversation sequences will fit within model_max_length tokens after image token insertion - the tokenizer truncates sequences but doesn't account for IMAGE_TOKEN_INDEX placeholders that get replaced with variable-length visual features
If this fails: Long conversations with images could exceed context window after visual feature insertion, causing CUDA out-of-memory errors or silently truncated visual information that breaks image-text alignment
llava/train/train.py:model_max_length
OpenAI API will always eventually succeed after rate limit retries - the infinite retry loop with time.sleep(NUM_SECONDS_TO_SLEEP) assumes temporary rate limits but doesn't handle permanent failures, quota exhaustion, or API key revocation
If this fails: Evaluation scripts can hang indefinitely if OpenAI API keys are invalid, quota is exceeded, or service is discontinued, blocking automated evaluation pipelines without clear error messages
llava/eval/eval_gpt_review.py:get_eval
Vision encoder and language model checkpoints are compatible with current CUDA/PyTorch environment - the model loading doesn't validate device compatibility, precision requirements, or CUDA compute capability
If this fails: Models could load successfully but fail during forward pass with CUDA version mismatches, unsupported operations on older GPUs, or precision errors that manifest as NaN losses
llava/model/builder.py:load_pretrained_model
CLIP image preprocessor transforms produce tensors with expected shape (3, 336, 336) that match the vision encoder input requirements - no validation that image_processor output dimensions align with vision_tower expected input
If this fails: If CLIP model configurations mismatch between preprocessing and vision encoder (different image sizes, normalization), the model silently processes wrong-shaped inputs, leading to degraded performance that's hard to diagnose
llava/mm_utils.py:process_images
Dataset files remain unchanged during training - LazySupervisedDataset caches file paths but doesn't check file modification times or handle concurrent writes to image directories
If this fails: If training data is modified during long training runs (images replaced, JSON files updated), the trainer could load mismatched images and conversation data, corrupting batch consistency
llava/train/train.py:LlavaTrainer
Model responses can be reliably classified as yes/no by checking for presence of 'No', 'not', 'no' keywords in the first sentence - this simple heuristic doesn't handle nuanced responses, multiple sentences, or complex negations
If this fails: Evaluation scores become unreliable when models generate nuanced answers like 'I cannot determine' or 'It appears unlikely', leading to incorrect benchmark results that don't reflect true model capability
llava/eval/eval_pope.py:eval_pope
Loss masking with IGNORE_INDEX (-100) correctly excludes human tokens from gradient computation - assumes PyTorch CrossEntropyLoss handles -100 as intended without checking for label preprocessing edge cases
If this fails: If label preprocessing creates unexpected values or if loss function behavior changes between PyTorch versions, human tokens might contribute to loss, degrading instruction-following capability
llava/constants.py:IGNORE_INDEX
Multiple choice predictions contain single letter answers (A, B, C, D, E) that can be directly mapped to choice indices - doesn't handle multi-character responses, explanations, or malformed model outputs
If this fails: Models that generate explanatory text or uncertain responses get random choice assignments, artificially inflating or deflating benchmark scores and masking model reasoning capabilities
llava/eval/eval_science_qa.py:get_pred_idx
All conversation templates (v0, v1, llama2, mpt) use consistent role naming and separator conventions - the get_prompt method assumes roles[0] is always the user role without validation
If this fails: If custom conversation templates have different role orderings or naming, prompts will have incorrect speaker labels, confusing the model about who is asking questions versus providing answers
llava/conversation.py:Conversation
See the full structural analysis of LLaVA: the pipeline, data models, and system behavior that put these assumptions in context.
Full analysis of haotian-liu/llava →Frequently Asked Questions
What does LLaVA assume that could break in production?
The one most likely to cause trouble: All images in the dataset have compatible dimensions after CLIP preprocessing - the process_images function expects PIL Images that can be resized and normalized, but never validates image format, color channels, or handles corrupted files If this fails, If dataset contains grayscale images, CMYK images, or corrupted image files, the CLIP transforms will fail with cryptic tensor shape errors during training, potentially hours into a training run
How many hidden assumptions does LLaVA have?
CodeSea found 12 assumptions LLaVA relies on but never validates, 4 of them critical, spanning Shape, Domain, Ordering, Scale, Resource, Environment, Contract, Temporal. Most are routine — the analysis flags the two or three most likely to actually bite.
What is a hidden assumption?
Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.