mlc-ai/mlc-llm

Universal LLM Deployment Engine with ML Compilation

22,494 stars Python 8 components

Compiles and runs large language models efficiently across diverse hardware platforms

Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component ml inference. 508 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Parse API request — FastAPIServer receives HTTP POST to /v1/chat/completions, deserializes JSON body into ChatCompletionRequest using Pydantic validation, checking required fields like messages and model name
Initialize request stream — RequestStreamManager creates a unique request ID, sets up streaming response channel if needed, and queues the request for processing by the engine [ChatCompletionRequest → GenerationRequest]
Tokenize input messages — TokenizerManager processes the conversation history through SentencePiece or tiktoken, converting text to token IDs and adding special tokens for chat format [ChatCompletionMessage → TokenSequence]
Execute model inference — ServeEngine loads the compiled model binary, allocates KV-cache blocks, runs forward pass through transformer layers using TVM-optimized kernels, generating logits for next token prediction [GenerationRequest → TokenSequence]
Stream response tokens — Generated tokens are decoded back to text, wrapped in OpenAI-compatible response format with usage statistics, and streamed as Server-Sent Events or returned as complete response [TokenSequence → ChatCompletionResponse]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ChatCompletionRequest python/mlc_llm/protocol/openai_api_protocol.py
Pydantic model with messages: List[ChatCompletionMessage], model: str, temperature: float, max_tokens: int, stream: bool, response_format: dict
Created from HTTP request JSON, validated against schema, passed to engine for processing, then discarded after response generation

ChatCompletionMessage python/mlc_llm/protocol/openai_api_protocol.py
Pydantic model with role: str ('user'|'assistant'|'system'), content: Union[str, List[Dict]], tool_calls: List[ChatToolCall]
Extracted from chat request, converted to token sequences by tokenizer, used to build conversation context for generation

GenerationRequest cpp/serve/request.h
C++ struct with request_id: string, input_ids: vector<int>, generation_config: GenerationConfig, stream: bool
Created from API request, queued in request pool, processed by generation engine, results streamed back to client

CompiledModel python/mlc_llm/compile
Directory structure with model_lib.so (compiled kernels), params/ (quantized weights), mlc-chat-config.json (model metadata)
Generated during compilation from source model, stored on disk, loaded by runtime engine for inference

TokenSequence cpp/tokenizers
vector<int32_t> representing token IDs with special tokens for BOS, EOS, padding
Created by tokenizer from text input, fed to model for generation, decoded back to text in streaming response

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Domain unguarded

assumes image URIs passed to MessageData always point to valid, readable image files that can be decoded by BitmapFactory

If this fails: if URI points to corrupted file, network location, or non-image content, BitmapFactory.decodeStream() returns null causing NullPointerException when converting to base64 string

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:imageUri parameter

critical Resource unguarded

assumes external storage directory exists and is writable, and that model files have been pre-installed in the exact expected path structure

If this fails: if external storage is unavailable, full, or model directory missing, MLCEngine initialization fails with cryptic native library errors instead of clear file not found errors

android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:modelPath creation

critical Contract unguarded

assumes all image bitmaps can be compressed to JPEG format and the resulting byte array fits in memory for base64 encoding

If this fails: large images (>50MB uncompressed) cause OutOfMemoryError during bitmap.compress() or Base64.encode(), crashing the chat session

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:Base64 encoding

critical Environment weakly guarded

assumes adb is installed, in PATH, device is connected via USB with debugging enabled, and device has sufficient storage space at /data/local/tmp/

If this fails: script fails silently or with cryptic errors if adb not found, device disconnected, or device storage full during weight file push

android/MLCChat/bundle_weight.py:adb commands

warning Scale unguarded

assumes the hardcoded model library name 'phi_msft_q4f16_1_686d8979c6ebf05d142d9081f1b87162' exactly matches the compiled .so file name in the app bundle

If this fails: if model is recompiled with different hash or quantization settings, native library loading fails with UnsatisfiedLinkError but no indication of name mismatch

android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:modelLib hardcoded string

warning Temporal unguarded

assumes image URIs returned by ActivityResultContracts.GetContent() remain valid and accessible for the lifetime of the chat session

If this fails: if user deletes image file or URI permissions are revoked after selection, subsequent access for message display or model input fails with SecurityException

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/MainActivity.kt:pickImageLauncher

warning Ordering weakly guarded

assumes coroutine operations for chat completion can be cancelled cleanly when ViewModel is destroyed, without leaving native inference threads running

If this fails: if user rapidly switches between chat screens or force-closes app during generation, native MLCEngine threads may continue consuming GPU resources until process death

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:viewModelScope.launch usage

warning Resource unguarded

assumes Android app has storage permissions and the hardcoded path '/storage/emulated/0/Android/data/ai.mlc.mlcchat/files/' is writable on all Android versions

If this fails: on Android 11+ with scoped storage, app may not have write access to this path causing weight file installation to fail silently

android/MLCChat/bundle_weight.py:device_weihgt_dir path

info Domain unguarded

assumes UUID.randomUUID() generates unique identifiers across all chat sessions and app restarts for message tracking

If this fails: while UUID collision is extremely unlikely, if it occurs, message updates could overwrite wrong messages in UI causing user confusion

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:UUID generation for message IDs

info Environment unguarded

assumes GlobalScope coroutines are appropriate for MLCEngine operations and will not prevent proper app lifecycle cleanup

If this fails: GlobalScope coroutines survive activity destruction, potentially keeping MLCEngine references alive and preventing proper GPU memory cleanup

android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:GlobalScope.launch usage

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

ModelRegistry (registry)
Maps model names to compiled binary paths and configuration metadata for runtime loading

KVCachePool (buffer)
Pre-allocated memory blocks for transformer attention key-value pairs, reused across requests with paging

RequestQueue (queue)
FIFO queue of pending generation requests with priority handling and batching logic

ResponseStreamPool (in-memory)
Active streaming connections with backpressure control and cleanup on client disconnect

Feedback Loops

Generation Loop (recursive, reinforcing) — Trigger: New token generation completes. Action: ServeEngine feeds generated token back as input for next prediction step, updating KV-cache state. Exit: EOS token generated or max_tokens reached.
Memory Pressure Adaptation (backpressure, balancing) — Trigger: KV-cache memory usage exceeds threshold. Action: RequestManager delays new request acceptance and triggers cache eviction of oldest unused blocks. Exit: Memory usage drops below safe threshold.
Stream Backpressure (backpressure, balancing) — Trigger: Client stops consuming response stream. Action: RequestStreamManager pauses token generation for that request to prevent memory buildup. Exit: Client resumes consumption or disconnects.

Delays

Model Loading (warmup, ~5-30 seconds) — First request to a model waits while binary is loaded and initialized in GPU memory
Compilation Cache (compilation, ~10-300 seconds) — Model must be compiled for target platform before first use, cached for subsequent runs
Token Generation (async-processing, ~50-500ms per token) — Each response token requires a full forward pass through the model, streamed incrementally

Control Points

Model Architecture Selection (architecture-switch) — Controls: Which model family to compile (Llama, GPT, ChatGLM) affecting kernel generation and memory layout
Quantization Mode (precision-mode) — Controls: Bit precision for weights (fp16, int4, int8) trading accuracy for memory and speed
Backend Selection (device-selection) — Controls: Hardware backend (CUDA, ROCm, Metal, Vulkan, OpenCL) determining kernel compilation target
Max Concurrent Requests (rate-limit) — Controls: Maximum number of parallel chat completion requests to prevent memory exhaustion
KV Cache Size (threshold) — Controls: Maximum memory allocated for attention cache affecting context length and throughput

Technology Stack

TVM (compute)
Compiles model operations into optimized kernels for different hardware backends

FastAPI (framework)
Provides OpenAI-compatible HTTP API endpoints for chat completions and embeddings

Pydantic (serialization)
Validates and serializes API request/response schemas following OpenAI specification

CMake (build)
Builds cross-platform C++ inference engine with hardware-specific optimizations

Android NDK (runtime)
Compiles C++ engine for Android ARM64 with OpenCL GPU acceleration

SentencePiece (library)
Tokenizes text input using vocabulary from trained language models

Transformers (library)
Loads model weights and configurations from HuggingFace format during compilation

Kotlin Coroutines (runtime)
Manages asynchronous streaming responses in Android applications

Key Components

JSONFFIEngine (adapter) — Bridges JSON-based API requests to the native C++ inference engine, handling serialization and background processing cpp/json_ffi/json_ffi.cc
ServeEngine (orchestrator) — Coordinates model loading, request queuing, memory management, and response generation across multiple requests cpp/serve/engine.cc
RequestStreamManager (dispatcher) — Manages streaming responses for concurrent chat completion requests with backpressure control and cleanup python/mlc_llm/serve/engine/request_stream_manager.py
ModelCompiler (transformer) — Converts HuggingFace models into TVM-compiled binaries with quantization and platform-specific optimization python/mlc_llm/compile/compile.py
TokenizerManager (encoder) — Handles text-to-token conversion using SentencePiece or tiktoken backends with vocabulary management cpp/tokenizers/tokenizer.cc
KVCacheManager (allocator) — Manages key-value cache memory allocation and paging for transformer attention across multiple requests cpp/serve/kv_cache.cc
FastAPIServer (gateway) — Exposes OpenAI-compatible HTTP endpoints for chat completions, embeddings, and model management python/mlc_llm/serve/server.py
MLCEngine (adapter) — Kotlin wrapper for Android apps that provides coroutine-based access to the native inference engine android/mlc4j/src/main/java/ai/mlc/mlcllm/MLCEngine.kt

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is mlc-llm used for?

Compiles and runs large language models efficiently across diverse hardware platforms mlc-ai/mlc-llm is a 8-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 508 files.

How is mlc-llm architected?

mlc-llm is organized into 4 architecture layers: Model Compilation, Runtime Engine, Platform Bindings, Protocol Layer. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through mlc-llm?

Data moves through 5 stages: Parse API request → Initialize request stream → Tokenize input messages → Execute model inference → Stream response tokens. Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does mlc-llm use?

The core stack includes TVM (Compiles model operations into optimized kernels for different hardware backends), FastAPI (Provides OpenAI-compatible HTTP API endpoints for chat completions and embeddings), Pydantic (Validates and serializes API request/response schemas following OpenAI specification), CMake (Builds cross-platform C++ inference engine with hardware-specific optimizations), Android NDK (Compiles C++ engine for Android ARM64 with OpenCL GPU acceleration), SentencePiece (Tokenizes text input using vocabulary from trained language models), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does mlc-llm have?

mlc-llm exhibits 4 data pools (ModelRegistry, KVCachePool), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle recursive and backpressure. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does mlc-llm use?

4 design patterns detected: Compilation-Runtime Split, Multi-Platform Abstraction, Memory Pool Management, Streaming Pipeline.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.