Hidden Assumptions in FastChat

11 assumptions this code never checks · 3 critical · spanning Shape, Domain, Scale, Contract, Ordering, Environment, Resource

Every codebase relies on things it never checks. Most of them are routine. CodeSea looked at lm-sys/fastchat and picked out the few most likely to cause trouble. The full list is just below.

Most of what this code assumes is routine. These 3 are the ones most likely to cause trouble here. The rest are minor; they're under "Show everything".

Worth your attention first

If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context

Worth your attention first

If ShareGPT changes their HTML format or uses different copy indicators, the regex pattern `code_lang_pattern` will fail to match, leaving malformed code blocks in the training data that could teach the model incorrect markdown formatting

Worth your attention first

If the function is called before the tokenizer is set up or in a different process context, it will raise a NameError or AttributeError, causing the entire conversation splitting pipeline to crash

Show everything (8 more)

Scale

Each conversation turn adds exactly 6 extra tokens beyond the tokenized content (`length = len(tokenizer(c['value']).input_ids) + 6`), presumably for conversation formatting tokens

If this fails: If different models use different numbers of special tokens for conversation formatting, the splitting will be inaccurate - conversations might exceed max_length for models needing more tokens or be unnecessarily short for models needing fewer

fastchat/data/split_long_conversation.py:split_one_sample

Contract

Model adapters will correctly map each model architecture to its specific SeparatorStyle enum value, with no overlap or ambiguity between styles

If this fails: If two different models accidentally use the same separator style, or a model gets mapped to the wrong style, conversation formatting will be incorrect, leading to degraded model performance as the prompt format won't match what the model was trained on

fastchat/conversation.py:SeparatorStyle

Ordering

Conversation splitting maintains perfect alternation between human and assistant messages, with the assertion `assert (end_idx - start_idx) % 2 == 0` enforcing even-length chunks

If this fails: If the conversation data has consecutive messages from the same speaker (like multiple human messages in a row), the splitting logic will create malformed training samples where roles don't alternate properly, confusing the model during training

fastchat/data/split_long_conversation.py:make_sample

Domain

ShareGPT conversations contain HTML that can be safely processed by markdownify and bs4, with no malicious or deeply nested content that could cause parsing failures

If this fails: If ShareGPT data contains pathological HTML (extremely deep nesting, malformed tags, or adversarial content), the HTML parsing could hang, consume excessive memory, or crash, blocking the entire data cleaning pipeline

fastchat/data/clean_sharegpt.py

Scale

The `max_length` global variable represents the model's actual context window limit and accounts for all tokens including special tokens, position embeddings, and any model-specific overhead

If this fails: If max_length doesn't account for model-specific overhead or if different model variants have different effective context limits, split conversations might still exceed the model's true capacity during training, causing OOM errors or truncation

fastchat/data/split_long_conversation.py:split_one_sample

Contract

All code blocks in ShareGPT follow valid markdown syntax after cleaning, with language identifiers that are recognized by markdown parsers

If this fails: If the extracted language identifiers contain special characters, spaces, or invalid syntax, the resulting markdown will render incorrectly in documentation tools and could confuse models trained on this data about proper code formatting

fastchat/data/clean_sharegpt.py:reformat_code

Resource

The ProcessPoolExecutor has sufficient memory to process multiple ShareGPT files concurrently, with each worker able to load and process entire JSON files in memory

If this fails: If ShareGPT files are very large (multi-GB) and multiple workers try to process them simultaneously, the system could run out of memory, causing the cleaning process to crash or swap heavily and become extremely slow

fastchat/data/clean_sharegpt.py

Domain

All unwanted 'Copy code' text in ShareGPT appears within code blocks delimited by triple backticks, never in regular conversation text where it might be legitimately part of the discussion

If this fails: If users were discussing code copying in their conversations and used the phrase 'Copy code' outside of actual code blocks, this legitimate content would be incorrectly removed, potentially creating nonsensical conversations in the training data

fastchat/data/clean_sharegpt.py:copy_code_pattern

See the full structural analysis of FastChat: the pipeline, data models, and system behavior that put these assumptions in context.

Full analysis of lm-sys/fastchat →

Frequently Asked Questions

What does FastChat assume that could break in production?

The one most likely to cause trouble: The conversations list has an even number of elements (human-assistant pairs), and enforces this by truncating odd-length conversations with `conversations = conversations[: len(conversations) // 2 * 2]` If this fails, If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context

How many hidden assumptions does FastChat have?

CodeSea found 11 assumptions FastChat relies on but never validates, 3 of them critical, spanning Shape, Domain, Scale, Contract, Ordering, Environment, Resource. Most are routine — the analysis flags the two or three most likely to actually bite.

What is a hidden assumption?

Something the code depends on but never checks: a data shape, an ordering, an environment condition, a scale limit, or a contract with another service. It holds until the world it runs in changes, then fails silently.