lm-sys/fastchat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

39,452 stars Python 9 components

Trains, serves, and evaluates LLM-based chatbots through multi-model serving systems

Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 9-component ml training. 148 files analyzed. Data flows through 6 distinct pipeline stages.

How Data Flows Through the System

Clean ShareGPT conversations — ShareGPTCleaner processes raw JSON files by converting HTML to markdown using markdownify, removing 'Copy code' artifacts with regex patterns, fixing malformed code blocks, and filtering out conversations with wrong formatting [ShareGPTConversation → ShareGPTConversation]
Split long conversations — ConversationSplitter tokenizes each conversation turn using the target model's tokenizer, accumulates token counts until max_length is reached, then creates new training samples with even number of turns (human-assistant pairs) [ShareGPTConversation → ShareGPTConversation]
Register model worker — BaseModelWorker loads model using transformers library, creates conversation template for the model type, registers with controller by sending worker info including supported models and capabilities
Route chat request — Controller receives ChatCompletionRequest from API server, validates model exists in registry, selects worker with lowest queue_length from available workers, forwards request to selected worker [ChatCompletionRequest → ChatCompletionRequest]
Generate model response — BaseModelWorker receives request, loads conversation template for the model, formats messages into prompt string using separator style, runs model inference with generation parameters, streams or returns complete response [ChatCompletionRequest → ChatCompletionResponse]
Judge model outputs — LLMJudge takes model responses and judge configuration, formats evaluation prompt with response text, sends to judge model (GPT-4, Claude, etc.), extracts numeric scores from judge response using regex patterns [Judge → judgment scores]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ChatCompletionRequest fastchat/protocol/api_protocol.py
Pydantic model with model: str, messages: List[Dict[str, str]], temperature: float=0.7, top_p: float=1.0, max_tokens: int=None, stream: bool=False, stop: List[str]=None
Created from HTTP request body, validated against schema, routed to appropriate model worker, and consumed to generate response

Conversation fastchat/conversation.py
dataclass with name: str, system_template: str, roles: List[str], messages: List[List[str]], offset: int, sep_style: SeparatorStyle, sep: str, sep2: str=None
Template loaded by model name, messages appended during chat, formatted into prompt string using separator style, and passed to model for generation

ShareGPTConversation fastchat/data/clean_sharegpt.py
dict with id: str, conversations: List[dict] where each dict has from: str ('human'|'gpt'), value: str
Loaded from JSON files, cleaned of HTML/markdown artifacts, filtered by language/format, split into training chunks, and converted to model training format

ModelCard fastchat/protocol/api_protocol.py
Pydantic model with id: str, object: str='model', created: int, owned_by: str='fastchat', permission: List[ModelPermission]
Created when model worker registers with controller, stored in controller's model registry, and returned in /v1/models API endpoint

WorkerInfo fastchat/serve/controller.py
dict with model_names: List[str], speed: int, queue_length: int, check_heart_beat: bool, last_heart_beat: float
Created when worker registers, updated on heartbeats with queue length and speed metrics, used by controller to select optimal worker for requests

Judge fastchat/llm_judge/common.py
dataclass with model_name: str, prompt_template: dict, ref_based: bool=False, multi_turn: bool=False
Loaded from judge configs, instantiated with specific prompt templates for evaluation tasks, used to generate judgments on model responses

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape weakly guarded

The conversations list has an even number of elements (human-assistant pairs), and enforces this by truncating odd-length conversations with `conversations = conversations[: len(conversations) // 2 * 2]`

If this fails: If the original conversation ends with a human message (odd length), that final turn gets silently dropped without any warning or logging, potentially losing important context

fastchat/data/split_long_conversation.py:split_one_sample

critical Domain unguarded

Code blocks in ShareGPT HTML follow the exact pattern `$<language>Copy code$<exact_code_here>` with the literal text 'Copy code' appearing between language and code content

If this fails: If ShareGPT changes their HTML format or uses different copy indicators, the regex pattern `code_lang_pattern` will fail to match, leaving malformed code blocks in the training data that could teach the model incorrect markdown formatting

fastchat/data/clean_sharegpt.py:reformat_code

warning Scale unguarded

Each conversation turn adds exactly 6 extra tokens beyond the tokenized content (`length = len(tokenizer(c['value']).input_ids) + 6`), presumably for conversation formatting tokens

If this fails: If different models use different numbers of special tokens for conversation formatting, the splitting will be inaccurate - conversations might exceed max_length for models needing more tokens or be unnecessarily short for models needing fewer

fastchat/data/split_long_conversation.py:split_one_sample

warning Contract unguarded

Model adapters will correctly map each model architecture to its specific SeparatorStyle enum value, with no overlap or ambiguity between styles

If this fails: If two different models accidentally use the same separator style, or a model gets mapped to the wrong style, conversation formatting will be incorrect, leading to degraded model performance as the prompt format won't match what the model was trained on

fastchat/conversation.py:SeparatorStyle

warning Ordering guarded

Conversation splitting maintains perfect alternation between human and assistant messages, with the assertion `assert (end_idx - start_idx) % 2 == 0` enforcing even-length chunks

If this fails: If the conversation data has consecutive messages from the same speaker (like multiple human messages in a row), the splitting logic will create malformed training samples where roles don't alternate properly, confusing the model during training

fastchat/data/split_long_conversation.py:make_sample

critical Environment unguarded

The global `tokenizer` variable is properly initialized and available when `split_one_sample` function is called, despite being used inside the function without parameter passing

If this fails: If the function is called before the tokenizer is set up or in a different process context, it will raise a NameError or AttributeError, causing the entire conversation splitting pipeline to crash

fastchat/data/split_long_conversation.py:split_one_sample

warning Domain unguarded

ShareGPT conversations contain HTML that can be safely processed by markdownify and bs4, with no malicious or deeply nested content that could cause parsing failures

If this fails: If ShareGPT data contains pathological HTML (extremely deep nesting, malformed tags, or adversarial content), the HTML parsing could hang, consume excessive memory, or crash, blocking the entire data cleaning pipeline

fastchat/data/clean_sharegpt.py

warning Scale unguarded

The `max_length` global variable represents the model's actual context window limit and accounts for all tokens including special tokens, position embeddings, and any model-specific overhead

If this fails: If max_length doesn't account for model-specific overhead or if different model variants have different effective context limits, split conversations might still exceed the model's true capacity during training, causing OOM errors or truncation

fastchat/data/split_long_conversation.py:split_one_sample

info Contract unguarded

All code blocks in ShareGPT follow valid markdown syntax after cleaning, with language identifiers that are recognized by markdown parsers

If this fails: If the extracted language identifiers contain special characters, spaces, or invalid syntax, the resulting markdown will render incorrectly in documentation tools and could confuse models trained on this data about proper code formatting

fastchat/data/clean_sharegpt.py:reformat_code

warning Resource unguarded

The ProcessPoolExecutor has sufficient memory to process multiple ShareGPT files concurrently, with each worker able to load and process entire JSON files in memory

If this fails: If ShareGPT files are very large (multi-GB) and multiple workers try to process them simultaneously, the system could run out of memory, causing the cleaning process to crash or swap heavily and become extremely slow

fastchat/data/clean_sharegpt.py

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Controller Worker Registry (in-memory)
Maps model names to available workers with health status, queue lengths, and capabilities

Conversation Templates (registry)
Static registry mapping model names to conversation formatting templates with separator styles and system prompts

Model Checkpoints (file-store)
HuggingFace model weights and tokenizers loaded by workers, cached in memory after first load

Training Datasets (file-store)
Processed conversation JSON files ready for fine-tuning, created by data cleaning pipeline

Feedback Loops

Worker Health Monitoring (polling, balancing) — Trigger: Controller startup. Action: Workers send periodic heartbeats with queue length and processing speed to controller. Exit: Worker disconnection or shutdown.
Load Balancing (auto-scale, balancing) — Trigger: New chat request. Action: Controller selects worker with shortest queue, updates worker load metrics after request completion. Exit: Request processed.
Arena Battle Loop (polling, reinforcing) — Trigger: User votes on model comparison. Action: Collect preference data, update Elo ratings, regenerate leaderboard rankings. Exit: Continuous operation.

Delays

Model Loading (warmup, ~30-120 seconds) — Worker unavailable until model loaded into GPU memory and conversation template initialized
First Token Latency (compilation, ~1-3 seconds) — Initial delay before streaming response starts due to prompt processing and first token generation
Judgment Generation (async-processing, ~5-30 seconds) — LLM judges process evaluation prompts asynchronously, creating delay between model response and evaluation score

Control Points

Model Selection (runtime-toggle) — Controls: Which models are available for serving based on registered workers
Temperature (hyperparameter) — Controls: Randomness in model generation, higher values produce more diverse outputs. Default: 0.7
Max Tokens (threshold) — Controls: Maximum response length, prevents runaway generation and controls costs. Default: null (unlimited)
Conversation Template (architecture-switch) — Controls: How multi-turn conversations are formatted for each model architecture. Default: model-specific
Judge Model (runtime-toggle) — Controls: Which LLM is used for evaluation judgments (GPT-4, Claude, etc.)

Technology Stack

PyTorch (framework)
Core ML framework for loading and running transformer models during training and inference

Transformers (library)
HuggingFace library providing model architectures, tokenizers, and training utilities for LLMs

FastAPI (framework)
Web framework powering the OpenAI-compatible API server with async request handling

Gradio (library)
Creates the web UI for chatting with models and running Arena battles

Uvicorn (runtime)
ASGI server running FastAPI applications for API endpoints

Accelerate (library)
Distributed training and inference library for multi-GPU model loading

Pydantic (serialization)
Data validation for API request/response schemas and configuration models

TikToken (library)
Tokenization library for counting tokens and splitting conversations

Key Components

Controller (orchestrator) — Central coordinator that maintains registry of available model workers, handles load balancing by selecting workers with shortest queues, and provides health monitoring through heartbeat mechanism fastchat/serve/controller.py
BaseModelWorker (processor) — Loads a specific model into memory, handles tokenization and text generation, manages conversation state using templates, and exposes HTTP API for inference requests fastchat/serve/model_worker.py
FastChatModel (adapter) — Adapts different model architectures (Llama, Vicuna, ChatGLM, etc.) to unified interface, handling model-specific tokenization, conversation formatting, and generation parameters fastchat/model/model_adapter.py
ConversationTemplate (formatter) — Formats multi-turn conversations into model-specific prompt strings, applying separator styles and system prompts according to each model's training format fastchat/conversation.py
ShareGPTCleaner (processor) — Cleans raw ShareGPT HTML data by converting to markdown, removing artifacts like 'Copy code' text, fixing malformed code blocks, and deduplicating conversations fastchat/data/clean_sharegpt.py
ConversationSplitter (processor) — Splits conversations exceeding model context length into multiple training samples, tokenizing each turn and creating chunks that fit within max_length constraints fastchat/data/split_long_conversation.py
GradioWebServer (gateway) — Provides web UI for chatting with models through Gradio interface, handles user sessions, manages conversation history, and routes requests to controller fastchat/serve/gradio_web_server.py
OpenAIAPIServer (gateway) — Exposes OpenAI-compatible REST API endpoints for chat completions, embeddings, and model listing, translating between OpenAI format and FastChat internal protocols fastchat/serve/openai_api_server.py
LLMJudge (evaluator) — Uses LLMs as judges to evaluate model responses, supporting single-answer judgment, pairwise comparison, and MT-bench evaluation with configurable judge models fastchat/llm_judge/gen_judgment.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is FastChat used for?

Trains, serves, and evaluates LLM-based chatbots through multi-model serving systems lm-sys/fastchat is a 9-component ml training written in Python. Data flows through 6 distinct pipeline stages. The codebase contains 148 files.

How is FastChat architected?

FastChat is organized into 6 architecture layers: Data Processing, Training, Model Workers, Serving Infrastructure, and 2 more. Data flows through 6 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through FastChat?

Data moves through 6 stages: Clean ShareGPT conversations → Split long conversations → Register model worker → Route chat request → Generate model response → .... Data enters FastChat through three main paths: training data flows from raw ShareGPT conversations through cleaning and splitting pipelines to create fine-tuning datasets; serving requests flow from HTTP clients through API servers and controllers to model workers that generate responses; evaluation data flows through LLM judges that score model outputs and compile rankings for the leaderboard. This pipeline design reflects a complex multi-stage processing system.

What technologies does FastChat use?

The core stack includes PyTorch (Core ML framework for loading and running transformer models during training and inference), Transformers (HuggingFace library providing model architectures, tokenizers, and training utilities for LLMs), FastAPI (Web framework powering the OpenAI-compatible API server with async request handling), Gradio (Creates the web UI for chatting with models and running Arena battles), Uvicorn (ASGI server running FastAPI applications for API endpoints), Accelerate (Distributed training and inference library for multi-GPU model loading), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does FastChat have?

FastChat exhibits 4 data pools (Controller Worker Registry, Conversation Templates), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does FastChat use?

5 design patterns detected: Model Adapter Pattern, Worker Pool, Pipeline Pattern, Plugin Architecture, Protocol Translation.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

lm-sys/fastchat

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Explore the interactive analysis

Related Ml Training Repositories

tensorflow/tensorflow

automatic1111/stable-diffusion-webui

huggingface/transformers

ggml-org/llama.cpp

pytorch/pytorch

openai/whisper

Frequently Asked Questions