Hidden Assumptions in FastChat
11 assumptions this code never checks · 3 critical · spanning Shape, Domain, Scale, Contract, Ordering, Environment, Resource
Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at lm-sys/fastchat and picked out the few most likely to cause trouble. The full list is just below.
Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".
If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context
If ShareGPT changes their HTML format or uses different copy indicators, the regex pattern `code_lang_pattern` will fail to match, leaving malformed code blocks in the training data that could teach the model incorrect markdown formatting
If the function is called before the tokenizer is set up or in a different process context, it will raise a NameError or AttributeError, causing the entire conversation splitting pipeline to crash
Show everything (8 more)
Each conversation turn adds exactly 6 extra tokens beyond the tokenized content (`length = len(tokenizer(c['value']).input_ids) + 6`), presumably for conversation formatting tokens
If this fails: If different models use different numbers of special tokens for conversation formatting, the splitting will be inaccurate - conversations might exceed max_length for models needing more tokens or be unnecessarily short for models needing fewer
fastchat/data/split_long_conversation.py:split_one_sample
Model adapters will correctly map each model architecture to its specific SeparatorStyle enum value, with no overlap or ambiguity between styles
If this fails: If two different models accidentally use the same separator style, or a model gets mapped to the wrong style, conversation formatting will be incorrect, leading to degraded model performance as the prompt format won't match what the model was trained on
fastchat/conversation.py:SeparatorStyle
Conversation splitting maintains perfect alternation between human and assistant messages, with the assertion `assert (end_idx - start_idx) % 2 == 0` enforcing even-length chunks
If this fails: If the conversation data has consecutive messages from the same speaker (like multiple human messages in a row), the splitting logic will create malformed training samples where roles don't alternate properly, confusing the model during training
fastchat/data/split_long_conversation.py:make_sample
ShareGPT conversations contain HTML that can be safely processed by markdownify and bs4, with no malicious or deeply nested content that could cause parsing failures
If this fails: If ShareGPT data contains pathological HTML (extremely deep nesting, malformed tags, or adversarial content), the HTML parsing could hang, consume excessive memory, or crash, blocking the entire data cleaning pipeline
fastchat/data/clean_sharegpt.py
The `max_length` global variable represents the model's actual context window limit and accounts for all tokens including special tokens, position embeddings, and any model-specific overhead
If this fails: If max_length doesn't account for model-specific overhead or if different model variants have different effective context limits, split conversations might still exceed the model's true capacity during training, causing OOM errors or truncation
fastchat/data/split_long_conversation.py:split_one_sample
All code blocks in ShareGPT follow valid markdown syntax after cleaning, with language identifiers that are recognized by markdown parsers
If this fails: If the extracted language identifiers contain special characters, spaces, or invalid syntax, the resulting markdown will render incorrectly in documentation tools and could confuse models trained on this data about proper code formatting
fastchat/data/clean_sharegpt.py:reformat_code
The ProcessPoolExecutor has sufficient memory to process multiple ShareGPT files concurrently, with each worker able to load and process entire JSON files in memory
If this fails: If ShareGPT files are very large (multi-GB) and multiple workers try to process them simultaneously, the system could run out of memory, causing the cleaning process to crash or swap heavily and become extremely slow
fastchat/data/clean_sharegpt.py
All unwanted 'Copy code' text in ShareGPT appears within code blocks delimited by triple backticks, never in regular conversation text where it might be legitimately part of the discussion
If this fails: If users were discussing code copying in their conversations and used the phrase 'Copy code' outside of actual code blocks, this legitimate content would be incorrectly removed, potentially creating nonsensical conversations in the training data
fastchat/data/clean_sharegpt.py:copy_code_pattern
See the full structural analysis of FastChat: the pipeline, data models, and system behavior that put these assumptions in context.
Full analysis of lm-sys/fastchat →Frequently Asked Questions
What does FastChat assume that could break in production?
The one most likely to cause trouble: The conversations list has an even number of elements (human-assistant pairs), and enforces this by truncating odd-length conversations with `conversations = conversations[: len(conversations) // 2 * 2]` If this fails, If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context
How many hidden assumptions does FastChat have?
CodeSea found 11 assumptions FastChat relies on but never validates, 3 of them critical, spanning Shape, Domain, Scale, Contract, Ordering, Environment, Resource. Most are routine — the analysis flags the two or three most likely to actually bite.
What is a hidden assumption?
Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.