lm-sys/fastchat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

39,452 stars Python 9 components

Trains, serves, and evaluates LLM-based chatbots through multi-model serving systems

Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 9-component ml training. 148 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard.

  1. Clean ShareGPT conversations — ShareGPTCleaner processes raw JSON files by converting HTML to markdown using markdownify, removing 'Copy code' artifacts with regex patterns, fixing malformed code blocks, and filtering out conversations with wrong formatting [ShareGPTConversation → ShareGPTConversation]
  2. Split long conversations — ConversationSplitter tokenizes each conversation turn using the target model's tokenizer, accumulates token counts until max_length is reached, then creates new training samples with even number of turns (human-assistant pairs) [ShareGPTConversation → ShareGPTConversation]
  3. Register model worker — BaseModelWorker loads model using transformers library, creates conversation template for the model type, registers with controller by sending worker info including supported models and capabilities
  4. Route chat request — Controller receives ChatCompletionRequest from API server, validates model exists in registry, selects worker with lowest queue_length from available workers, forwards request to selected worker [ChatCompletionRequest → ChatCompletionRequest]
  5. Generate model response — BaseModelWorker receives request, loads conversation template for the model, formats messages into prompt string using separator style, runs model inference with generation parameters, streams or returns complete response [ChatCompletionRequest → ChatCompletionResponse]
  6. Judge model outputs — LLMJudge takes model responses and judge configuration, formats evaluation prompt with response text, sends to judge model (GPT-4, Claude, etc.), extracts numeric scores from judge response using regex patterns [Judge → judgment scores]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ChatCompletionRequest fastchat/protocol/api_protocol.py
Pydantic model with model: str, messages: List[Dict[str, str]], temperature: float=0.7, top_p: float=1.0, max_tokens: int=None, stream: bool=False, stop: List[str]=None
Created from HTTP request body, validated against schema, routed to appropriate model worker, and consumed to generate response
Conversation fastchat/conversation.py
dataclass with name: str, system_template: str, roles: List[str], messages: List[List[str]], offset: int, sep_style: SeparatorStyle, sep: str, sep2: str=None
Template loaded by model name, messages appended during chat, formatted into prompt string using separator style, and passed to model for generation
ShareGPTConversation fastchat/data/clean_sharegpt.py
dict with id: str, conversations: List[dict] where each dict has from: str ('human'|'gpt'), value: str
Loaded from JSON files, cleaned of HTML/markdown artifacts, filtered by language/format, split into training chunks, and converted to model training format
ModelCard fastchat/protocol/api_protocol.py
Pydantic model with id: str, object: str='model', created: int, owned_by: str='fastchat', permission: List[ModelPermission]
Created when model worker registers with controller, stored in controller's model registry, and returned in /v1/models API endpoint
WorkerInfo fastchat/serve/controller.py
dict with model_names: List[str], speed: int, queue_length: int, check_heart_beat: bool, last_heart_beat: float
Created when worker registers, updated on heartbeats with queue length and speed metrics, used by controller to select optimal worker for requests
Judge fastchat/llm_judge/common.py
dataclass with model_name: str, prompt_template: dict, ref_based: bool=False, multi_turn: bool=False
Loaded from judge configs, instantiated with specific prompt templates for evaluation tasks, used to generate judgments on model responses

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape weakly guarded

The conversations list has an even number of elements (human-assistant pairs), and enforces this by truncating odd-length conversations with `conversations = conversations[: len(conversations) // 2 * 2]`

If this fails: If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context

fastchat/data/split_long_conversation.py:split_one_sample
critical Domain unguarded

Code blocks in ShareGPT HTML follow the exact pattern `$<language>Copy code$<exact_code_here>` with the literal text 'Copy code' appearing between language and code content

If this fails: If ShareGPT changes their HTML format or uses different copy indicators, the regex pattern `code_lang_pattern` will fail to match, leaving malformed code blocks in the training data that could teach the model incorrect markdown formatting

fastchat/data/clean_sharegpt.py:reformat_code
warning Scale unguarded

Each conversation turn adds exactly 6 extra tokens beyond the tokenized content (`length = len(tokenizer(c['value']).input_ids) + 6`), presumably for conversation formatting tokens

If this fails: If different models use different numbers of special tokens for conversation formatting, the splitting will be inaccurate - conversations might exceed max_length for models needing more tokens or be unnecessarily short for models needing fewer

fastchat/data/split_long_conversation.py:split_one_sample
warning Contract unguarded

Model adapters will correctly map each model architecture to its specific SeparatorStyle enum value, with no overlap or ambiguity between styles

If this fails: If two different models accidentally use the same separator style, or a model gets mapped to the wrong style, conversation formatting will be incorrect, leading to degraded model performance as the prompt format won't match what the model was trained on

fastchat/conversation.py:SeparatorStyle
warning Ordering guarded

Conversation splitting maintains perfect alternation between human and assistant messages, with the assertion `assert (end_idx - start_idx) % 2 == 0` enforcing even-length chunks

If this fails: If the conversation data has consecutive messages from the same speaker (like multiple human messages in a row), the splitting logic will create malformed training samples where roles don't alternate properly, confusing the model during training

fastchat/data/split_long_conversation.py:make_sample
critical Environment unguarded

The global `tokenizer` variable is properly initialized and available when `split_one_sample` function is called, despite being used inside the function without parameter passing

If this fails: If the function is called before the tokenizer is set up or in a different process context, it will raise a NameError or AttributeError, causing the entire conversation splitting pipeline to crash

fastchat/data/split_long_conversation.py:split_one_sample
warning Domain unguarded

ShareGPT conversations contain HTML that can be safely processed by markdownify and bs4, with no malicious or deeply nested content that could cause parsing failures

If this fails: If ShareGPT data contains pathological HTML (extremely deep nesting, malformed tags, or adversarial content), the HTML parsing could hang, consume excessive memory, or crash, blocking the entire data cleaning pipeline

fastchat/data/clean_sharegpt.py
warning Scale unguarded

The `max_length` global variable represents the model's actual context window limit and accounts for all tokens including special tokens, position embeddings, and any model-specific overhead

If this fails: If max_length doesn't account for model-specific overhead or if different model variants have different effective context limits, split conversations might still exceed the model's true capacity during training, causing OOM errors or truncation

fastchat/data/split_long_conversation.py:split_one_sample
info Contract unguarded

All code blocks in ShareGPT follow valid markdown syntax after cleaning, with language identifiers that are recognized by markdown parsers

If this fails: If the extracted language identifiers contain special characters, spaces, or invalid syntax, the resulting markdown will render incorrectly in documentation tools and could confuse models trained on this data about proper code formatting

fastchat/data/clean_sharegpt.py:reformat_code
warning Resource unguarded

The ProcessPoolExecutor has sufficient memory to process multiple ShareGPT files concurrently, with each worker able to load and process entire JSON files in memory

If this fails: If ShareGPT files are very large (multi-GB) and multiple workers try to process them simultaneously, the system could run out of memory, causing the cleaning process to crash or swap heavily and become extremely slow

fastchat/data/clean_sharegpt.py

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Controller Worker Registry (in-memory)
Maps model names to available workers with health status, queue lengths, and capabilities
Conversation Templates (registry)
Static registry mapping model names to conversation formatting templates with separator styles and system prompts
Model Checkpoints (file-store)
HuggingFace model weights and tokenizers loaded by workers, cached in memory after first load
Training Datasets (file-store)
Processed conversation JSON files ready for fine-tuning, created by data cleaning pipeline

Feedback Loops

Delays

Control Points

Technology Stack

PyTorch (framework)
Core ML framework for loading and running transformer models during training and inference
Transformers (library)
HuggingFace library providing model architectures, tokenizers, and training utilities for LLMs
FastAPI (framework)
Web framework powering the OpenAI-compatible API server with async request handling
Gradio (library)
Creates the web UI for chatting with models and running Arena battles
Uvicorn (runtime)
ASGI server running FastAPI applications for API endpoints
Accelerate (library)
Distributed training and inference library for multi-GPU model loading
Pydantic (serialization)
Data validation for API request/response schemas and configuration models
TikToken (library)
Tokenization library for counting tokens and splitting conversations

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is FastChat used for?

Trains, serves, and evaluates LLM-based chatbots through multi-model serving systems lm-sys/fastchat is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 148 files.

How is FastChat architected?

FastChat is organized into 6 architecture layers: Data Processing, Training, Model Workers, Serving Infrastructure, and 2 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through FastChat?

Data moves through 6 stages: Clean ShareGPT conversations → Split long conversations → Register model worker → Route chat request → Generate model response → .... Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard. This pipeline design reflects a complex multi-stage processing system.

What technologies does FastChat use?

The core stack includes PyTorch (Core ML framework for loading and running transformer models during training and inference), Transformers (HuggingFace library providing model architectures, tokenizers, and training utilities for LLMs), FastAPI (Web framework powering the OpenAI-compatible API server with async request handling), Gradio (Creates the web UI for chatting with models and running Arena battles), Uvicorn (ASGI server running FastAPI applications for API endpoints), Accelerate (Distributed training and inference library for multi-GPU model loading), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does FastChat have?

FastChat exhibits 4 data pools (Controller Worker Registry, Conversation Templates), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does FastChat use?

5 design patterns detected: Model Adapter Pattern, Worker Pool, Pipeline Pattern, Plugin Architecture, Protocol Translation.

Analyzed on April 20, 2026 by CodeSea. Written by .