lm-sys/fastchat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Trains, serves, and evaluates LLM-based chatbots through multi-model serving systems
Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard.
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 9-component ml training. 148 files analyzed. Data flows through 6 distinct pipeline stages.
How Data Flows Through the System
Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard.
- Clean ShareGPT conversations — ShareGPTCleaner processes raw JSON files by converting HTML to markdown using markdownify, removing 'Copy code' artifacts with regex patterns, fixing malformed code blocks, and filtering out conversations with wrong formatting [ShareGPTConversation → ShareGPTConversation]
- Split long conversations — ConversationSplitter tokenizes each conversation turn using the target model's tokenizer, accumulates token counts until max_length is reached, then creates new training samples with even number of turns (human-assistant pairs) [ShareGPTConversation → ShareGPTConversation]
- Register model worker — BaseModelWorker loads model using transformers library, creates conversation template for the model type, registers with controller by sending worker info including supported models and capabilities
- Route chat request — Controller receives ChatCompletionRequest from API server, validates model exists in registry, selects worker with lowest queue_length from available workers, forwards request to selected worker [ChatCompletionRequest → ChatCompletionRequest]
- Generate model response — BaseModelWorker receives request, loads conversation template for the model, formats messages into prompt string using separator style, runs model inference with generation parameters, streams or returns complete response [ChatCompletionRequest → ChatCompletionResponse]
- Judge model outputs — LLMJudge takes model responses and judge configuration, formats evaluation prompt with response text, sends to judge model (GPT-4, Claude, etc.), extracts numeric scores from judge response using regex patterns [Judge → judgment scores]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
fastchat/protocol/api_protocol.pyPydantic model with model: str, messages: List[Dict[str, str]], temperature: float=0.7, top_p: float=1.0, max_tokens: int=None, stream: bool=False, stop: List[str]=None
Created from HTTP request body, validated against schema, routed to appropriate model worker, and consumed to generate response
fastchat/conversation.pydataclass with name: str, system_template: str, roles: List[str], messages: List[List[str]], offset: int, sep_style: SeparatorStyle, sep: str, sep2: str=None
Template loaded by model name, messages appended during chat, formatted into prompt string using separator style, and passed to model for generation
fastchat/data/clean_sharegpt.pydict with id: str, conversations: List[dict] where each dict has from: str ('human'|'gpt'), value: str
Loaded from JSON files, cleaned of HTML/markdown artifacts, filtered by language/format, split into training chunks, and converted to model training format
fastchat/protocol/api_protocol.pyPydantic model with id: str, object: str='model', created: int, owned_by: str='fastchat', permission: List[ModelPermission]
Created when model worker registers with controller, stored in controller's model registry, and returned in /v1/models API endpoint
fastchat/serve/controller.pydict with model_names: List[str], speed: int, queue_length: int, check_heart_beat: bool, last_heart_beat: float
Created when worker registers, updated on heartbeats with queue length and speed metrics, used by controller to select optimal worker for requests
fastchat/llm_judge/common.pydataclass with model_name: str, prompt_template: dict, ref_based: bool=False, multi_turn: bool=False
Loaded from judge configs, instantiated with specific prompt templates for evaluation tasks, used to generate judgments on model responses
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The conversations list has an even number of elements (human-assistant pairs), and enforces this by truncating odd-length conversations with `conversations = conversations[: len(conversations) // 2 * 2]`
If this fails: If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context
fastchat/data/split_long_conversation.py:split_one_sample
Code blocks in ShareGPT HTML follow the exact pattern `$<language>Copy code$<exact_code_here>` with the literal text 'Copy code' appearing between language and code content
If this fails: If ShareGPT changes their HTML format or uses different copy indicators, the regex pattern `code_lang_pattern` will fail to match, leaving malformed code blocks in the training data that could teach the model incorrect markdown formatting
fastchat/data/clean_sharegpt.py:reformat_code
Each conversation turn adds exactly 6 extra tokens beyond the tokenized content (`length = len(tokenizer(c['value']).input_ids) + 6`), presumably for conversation formatting tokens
If this fails: If different models use different numbers of special tokens for conversation formatting, the splitting will be inaccurate - conversations might exceed max_length for models needing more tokens or be unnecessarily short for models needing fewer
fastchat/data/split_long_conversation.py:split_one_sample
Model adapters will correctly map each model architecture to its specific SeparatorStyle enum value, with no overlap or ambiguity between styles
If this fails: If two different models accidentally use the same separator style, or a model gets mapped to the wrong style, conversation formatting will be incorrect, leading to degraded model performance as the prompt format won't match what the model was trained on
fastchat/conversation.py:SeparatorStyle
Conversation splitting maintains perfect alternation between human and assistant messages, with the assertion `assert (end_idx - start_idx) % 2 == 0` enforcing even-length chunks
If this fails: If the conversation data has consecutive messages from the same speaker (like multiple human messages in a row), the splitting logic will create malformed training samples where roles don't alternate properly, confusing the model during training
fastchat/data/split_long_conversation.py:make_sample
The global `tokenizer` variable is properly initialized and available when `split_one_sample` function is called, despite being used inside the function without parameter passing
If this fails: If the function is called before the tokenizer is set up or in a different process context, it will raise a NameError or AttributeError, causing the entire conversation splitting pipeline to crash
fastchat/data/split_long_conversation.py:split_one_sample
ShareGPT conversations contain HTML that can be safely processed by markdownify and bs4, with no malicious or deeply nested content that could cause parsing failures
If this fails: If ShareGPT data contains pathological HTML (extremely deep nesting, malformed tags, or adversarial content), the HTML parsing could hang, consume excessive memory, or crash, blocking the entire data cleaning pipeline
fastchat/data/clean_sharegpt.py
The `max_length` global variable represents the model's actual context window limit and accounts for all tokens including special tokens, position embeddings, and any model-specific overhead
If this fails: If max_length doesn't account for model-specific overhead or if different model variants have different effective context limits, split conversations might still exceed the model's true capacity during training, causing OOM errors or truncation
fastchat/data/split_long_conversation.py:split_one_sample
All code blocks in ShareGPT follow valid markdown syntax after cleaning, with language identifiers that are recognized by markdown parsers
If this fails: If the extracted language identifiers contain special characters, spaces, or invalid syntax, the resulting markdown will render incorrectly in documentation tools and could confuse models trained on this data about proper code formatting
fastchat/data/clean_sharegpt.py:reformat_code
The ProcessPoolExecutor has sufficient memory to process multiple ShareGPT files concurrently, with each worker able to load and process entire JSON files in memory
If this fails: If ShareGPT files are very large (multi-GB) and multiple workers try to process them simultaneously, the system could run out of memory, causing the cleaning process to crash or swap heavily and become extremely slow
fastchat/data/clean_sharegpt.py
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Maps model names to available workers with health status, queue lengths, and capabilities
Static registry mapping model names to conversation formatting templates with separator styles and system prompts
HuggingFace model weights and tokenizers loaded by workers, cached in memory after first load
Processed conversation JSON files ready for fine-tuning, created by data cleaning pipeline
Feedback Loops
- Worker Health Monitoring (polling, balancing) — Trigger: Controller startup. Action: Workers send periodic heartbeats with queue length and processing speed to controller. Exit: Worker disconnection or shutdown.
- Load Balancing (auto-scale, balancing) — Trigger: New chat request. Action: Controller selects worker with shortest queue, updates worker load metrics after request completion. Exit: Request processed.
- Arena Battle Loop (polling, reinforcing) — Trigger: User votes on model comparison. Action: Collect preference data, update Elo ratings, regenerate leaderboard rankings. Exit: Continuous operation.
Delays
- Model Loading (warmup, ~30-120 seconds) — Worker unavailable until model loaded into GPU memory and conversation template initialized
- First Token Latency (compilation, ~1-3 seconds) — Initial delay before streaming response starts due to prompt processing and first token generation
- Judgment Generation (async-processing, ~5-30 seconds) — LLM judges process evaluation prompts asynchronously, creating delay between model response and evaluation score
Control Points
- Model Selection (runtime-toggle) — Controls: Which models are available for serving based on registered workers
- Temperature (hyperparameter) — Controls: Randomness in model generation, higher values produce more diverse outputs. Default: 0.7
- Max Tokens (threshold) — Controls: Maximum response length, prevents runaway generation and controls costs. Default: null (unlimited)
- Conversation Template (architecture-switch) — Controls: How multi-turn conversations are formatted for each model architecture. Default: model-specific
- Judge Model (runtime-toggle) — Controls: Which LLM is used for evaluation judgments (GPT-4, Claude, etc.)
Technology Stack
Core ML framework for loading and running transformer models during training and inference
HuggingFace library providing model architectures, tokenizers, and training utilities for LLMs
Web framework powering the OpenAI-compatible API server with async request handling
Creates the web UI for chatting with models and running Arena battles
ASGI server running FastAPI applications for API endpoints
Distributed training and inference library for multi-GPU model loading
Data validation for API request/response schemas and configuration models
Tokenization library for counting tokens and splitting conversations
Key Components
- Controller (orchestrator) — Central coordinator that maintains registry of available model workers, handles load balancing by selecting workers with shortest queues, and provides health monitoring through heartbeat mechanism
fastchat/serve/controller.py - BaseModelWorker (processor) — Loads a specific model into memory, handles tokenization and text generation, manages conversation state using templates, and exposes HTTP API for inference requests
fastchat/serve/model_worker.py - FastChatModel (adapter) — Adapts different model architectures (Llama, Vicuna, ChatGLM, etc.) to unified interface, handling model-specific tokenization, conversation formatting, and generation parameters
fastchat/model/model_adapter.py - ConversationTemplate (formatter) — Formats multi-turn conversations into model-specific prompt strings, applying separator styles and system prompts according to each model's training format
fastchat/conversation.py - ShareGPTCleaner (processor) — Cleans raw ShareGPT HTML data by converting to markdown, removing artifacts like 'Copy code' text, fixing malformed code blocks, and deduplicating conversations
fastchat/data/clean_sharegpt.py - ConversationSplitter (processor) — Splits conversations exceeding model context length into multiple training samples, tokenizing each turn and creating chunks that fit within max_length constraints
fastchat/data/split_long_conversation.py - GradioWebServer (gateway) — Provides web UI for chatting with models through Gradio interface, handles user sessions, manages conversation history, and routes requests to controller
fastchat/serve/gradio_web_server.py - OpenAIAPIServer (gateway) — Exposes OpenAI-compatible REST API endpoints for chat completions, embeddings, and model listing, translating between OpenAI format and FastChat internal protocols
fastchat/serve/openai_api_server.py - LLMJudge (evaluator) — Uses LLMs as judges to evaluate model responses, supporting single-answer judgment, pairwise comparison, and MT-bench evaluation with configurable judge models
fastchat/llm_judge/gen_judgment.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is FastChat used for?
Trains, serves, and evaluates LLM-based chatbots through multi-model serving systems lm-sys/fastchat is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 148 files.
How is FastChat architected?
FastChat is organized into 6 architecture layers: Data Processing, Training, Model Workers, Serving Infrastructure, and 2 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through FastChat?
Data moves through 6 stages: Clean ShareGPT conversations → Split long conversations → Register model worker → Route chat request → Generate model response → .... Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard. This pipeline design reflects a complex multi-stage processing system.
What technologies does FastChat use?
The core stack includes PyTorch (Core ML framework for loading and running transformer models during training and inference), Transformers (HuggingFace library providing model architectures, tokenizers, and training utilities for LLMs), FastAPI (Web framework powering the OpenAI-compatible API server with async request handling), Gradio (Creates the web UI for chatting with models and running Arena battles), Uvicorn (ASGI server running FastAPI applications for API endpoints), Accelerate (Distributed training and inference library for multi-GPU model loading), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does FastChat have?
FastChat exhibits 4 data pools (Controller Worker Registry, Conversation Templates), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does FastChat use?
5 design patterns detected: Model Adapter Pattern, Worker Pool, Pipeline Pattern, Plugin Architecture, Protocol Translation.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.