mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
Compiles and runs large language models efficiently across diverse hardware platforms
Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses.
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 8-component ml inference. 508 files analyzed. Data flows through 5 distinct pipeline stages.
How Data Flows Through the System
Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses.
- Parse API request — FastAPIServer receives HTTP POST to /v1/chat/completions, deserializes JSON body into ChatCompletionRequest using Pydantic validation, checking required fields like messages and model name
- Initialize request stream — RequestStreamManager creates a unique request ID, sets up streaming response channel if needed, and queues the request for processing by the engine [ChatCompletionRequest → GenerationRequest]
- Tokenize input messages — TokenizerManager processes the conversation history through SentencePiece or tiktoken, converting text to token IDs and adding special tokens for chat format [ChatCompletionMessage → TokenSequence]
- Execute model inference — ServeEngine loads the compiled model binary, allocates KV-cache blocks, runs forward pass through transformer layers using TVM-optimized kernels, generating logits for next token prediction [GenerationRequest → TokenSequence]
- Stream response tokens — Generated tokens are decoded back to text, wrapped in OpenAI-compatible response format with usage statistics, and streamed as Server-Sent Events or returned as complete response [TokenSequence → ChatCompletionResponse]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
python/mlc_llm/protocol/openai_api_protocol.pyPydantic model with messages: List[ChatCompletionMessage], model: str, temperature: float, max_tokens: int, stream: bool, response_format: dict
Created from HTTP request JSON, validated against schema, passed to engine for processing, then discarded after response generation
python/mlc_llm/protocol/openai_api_protocol.pyPydantic model with role: str ('user'|'assistant'|'system'), content: Union[str, List[Dict]], tool_calls: List[ChatToolCall]
Extracted from chat request, converted to token sequences by tokenizer, used to build conversation context for generation
cpp/serve/request.hC++ struct with request_id: string, input_ids: vector<int>, generation_config: GenerationConfig, stream: bool
Created from API request, queued in request pool, processed by generation engine, results streamed back to client
python/mlc_llm/compileDirectory structure with model_lib.so (compiled kernels), params/ (quantized weights), mlc-chat-config.json (model metadata)
Generated during compilation from source model, stored on disk, loaded by runtime engine for inference
cpp/tokenizersvector<int32_t> representing token IDs with special tokens for BOS, EOS, padding
Created by tokenizer from text input, fed to model for generation, decoded back to text in streaming response
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
assumes image URIs passed to MessageData always point to valid, readable image files that can be decoded by BitmapFactory
If this fails: if URI points to corrupted file, network location, or non-image content, BitmapFactory.decodeStream() returns null causing NullPointerException when converting to base64 string
android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:imageUri parameter
assumes external storage directory exists and is writable, and that model files have been pre-installed in the exact expected path structure
If this fails: if external storage is unavailable, full, or model directory missing, MLCEngine initialization fails with cryptic native library errors instead of clear file not found errors
android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:modelPath creation
assumes all image bitmaps can be compressed to JPEG format and the resulting byte array fits in memory for base64 encoding
If this fails: large images (>50MB uncompressed) cause OutOfMemoryError during bitmap.compress() or Base64.encode(), crashing the chat session
android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:Base64 encoding
assumes adb is installed, in PATH, device is connected via USB with debugging enabled, and device has sufficient storage space at /data/local/tmp/
If this fails: script fails silently or with cryptic errors if adb not found, device disconnected, or device storage full during weight file push
android/MLCChat/bundle_weight.py:adb commands
assumes the hardcoded model library name 'phi_msft_q4f16_1_686d8979c6ebf05d142d9081f1b87162' exactly matches the compiled .so file name in the app bundle
If this fails: if model is recompiled with different hash or quantization settings, native library loading fails with UnsatisfiedLinkError but no indication of name mismatch
android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:modelLib hardcoded string
assumes image URIs returned by ActivityResultContracts.GetContent() remain valid and accessible for the lifetime of the chat session
If this fails: if user deletes image file or URI permissions are revoked after selection, subsequent access for message display or model input fails with SecurityException
android/MLCChat/app/src/main/java/ai/mlc/mlcchat/MainActivity.kt:pickImageLauncher
assumes coroutine operations for chat completion can be cancelled cleanly when ViewModel is destroyed, without leaving native inference threads running
If this fails: if user rapidly switches between chat screens or force-closes app during generation, native MLCEngine threads may continue consuming GPU resources until process death
android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:viewModelScope.launch usage
assumes Android app has storage permissions and the hardcoded path '/storage/emulated/0/Android/data/ai.mlc.mlcchat/files/' is writable on all Android versions
If this fails: on Android 11+ with scoped storage, app may not have write access to this path causing weight file installation to fail silently
android/MLCChat/bundle_weight.py:device_weihgt_dir path
assumes UUID.randomUUID() generates unique identifiers across all chat sessions and app restarts for message tracking
If this fails: while UUID collision is extremely unlikely, if it occurs, message updates could overwrite wrong messages in UI causing user confusion
android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:UUID generation for message IDs
assumes GlobalScope coroutines are appropriate for MLCEngine operations and will not prevent proper app lifecycle cleanup
If this fails: GlobalScope coroutines survive activity destruction, potentially keeping MLCEngine references alive and preventing proper GPU memory cleanup
android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:GlobalScope.launch usage
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Maps model names to compiled binary paths and configuration metadata for runtime loading
Pre-allocated memory blocks for transformer attention key-value pairs, reused across requests with paging
FIFO queue of pending generation requests with priority handling and batching logic
Active streaming connections with backpressure control and cleanup on client disconnect
Feedback Loops
- Generation Loop (recursive, reinforcing) — Trigger: New token generation completes. Action: ServeEngine feeds generated token back as input for next prediction step, updating KV-cache state. Exit: EOS token generated or max_tokens reached.
- Memory Pressure Adaptation (backpressure, balancing) — Trigger: KV-cache memory usage exceeds threshold. Action: RequestManager delays new request acceptance and triggers cache eviction of oldest unused blocks. Exit: Memory usage drops below safe threshold.
- Stream Backpressure (backpressure, balancing) — Trigger: Client stops consuming response stream. Action: RequestStreamManager pauses token generation for that request to prevent memory buildup. Exit: Client resumes consumption or disconnects.
Delays
- Model Loading (warmup, ~5-30 seconds) — First request to a model waits while binary is loaded and initialized in GPU memory
- Compilation Cache (compilation, ~10-300 seconds) — Model must be compiled for target platform before first use, cached for subsequent runs
- Token Generation (async-processing, ~50-500ms per token) — Each response token requires a full forward pass through the model, streamed incrementally
Control Points
- Model Architecture Selection (architecture-switch) — Controls: Which model family to compile (Llama, GPT, ChatGLM) affecting kernel generation and memory layout
- Quantization Mode (precision-mode) — Controls: Bit precision for weights (fp16, int4, int8) trading accuracy for memory and speed
- Backend Selection (device-selection) — Controls: Hardware backend (CUDA, ROCm, Metal, Vulkan, OpenCL) determining kernel compilation target
- Max Concurrent Requests (rate-limit) — Controls: Maximum number of parallel chat completion requests to prevent memory exhaustion
- KV Cache Size (threshold) — Controls: Maximum memory allocated for attention cache affecting context length and throughput
Technology Stack
Compiles model operations into optimized kernels for different hardware backends
Provides OpenAI-compatible HTTP API endpoints for chat completions and embeddings
Validates and serializes API request/response schemas following OpenAI specification
Builds cross-platform C++ inference engine with hardware-specific optimizations
Compiles C++ engine for Android ARM64 with OpenCL GPU acceleration
Tokenizes text input using vocabulary from trained language models
Loads model weights and configurations from HuggingFace format during compilation
Manages asynchronous streaming responses in Android applications
Key Components
- JSONFFIEngine (adapter) — Bridges JSON-based API requests to the native C++ inference engine, handling serialization and background processing
cpp/json_ffi/json_ffi.cc - ServeEngine (orchestrator) — Coordinates model loading, request queuing, memory management, and response generation across multiple requests
cpp/serve/engine.cc - RequestStreamManager (dispatcher) — Manages streaming responses for concurrent chat completion requests with backpressure control and cleanup
python/mlc_llm/serve/engine/request_stream_manager.py - ModelCompiler (transformer) — Converts HuggingFace models into TVM-compiled binaries with quantization and platform-specific optimization
python/mlc_llm/compile/compile.py - TokenizerManager (encoder) — Handles text-to-token conversion using SentencePiece or tiktoken backends with vocabulary management
cpp/tokenizers/tokenizer.cc - KVCacheManager (allocator) — Manages key-value cache memory allocation and paging for transformer attention across multiple requests
cpp/serve/kv_cache.cc - FastAPIServer (gateway) — Exposes OpenAI-compatible HTTP endpoints for chat completions, embeddings, and model management
python/mlc_llm/serve/server.py - MLCEngine (adapter) — Kotlin wrapper for Android apps that provides coroutine-based access to the native inference engine
android/mlc4j/src/main/java/ai/mlc/mlcllm/MLCEngine.kt
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Inference Repositories
Frequently Asked Questions
What is mlc-llm used for?
Compiles and runs large language models efficiently across diverse hardware platforms mlc-ai/mlc-llm is a 8-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 508 files.
How is mlc-llm architected?
mlc-llm is organized into 4 architecture layers: Model Compilation, Runtime Engine, Platform Bindings, Protocol Layer. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through mlc-llm?
Data moves through 5 stages: Parse API request → Initialize request stream → Tokenize input messages → Execute model inference → Stream response tokens. Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses. This pipeline design reflects a complex multi-stage processing system.
What technologies does mlc-llm use?
The core stack includes TVM (Compiles model operations into optimized kernels for different hardware backends), FastAPI (Provides OpenAI-compatible HTTP API endpoints for chat completions and embeddings), Pydantic (Validates and serializes API request/response schemas following OpenAI specification), CMake (Builds cross-platform C++ inference engine with hardware-specific optimizations), Android NDK (Compiles C++ engine for Android ARM64 with OpenCL GPU acceleration), SentencePiece (Tokenizes text input using vocabulary from trained language models), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does mlc-llm have?
mlc-llm exhibits 4 data pools (ModelRegistry, KVCachePool), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle recursive and backpressure. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does mlc-llm use?
4 design patterns detected: Compilation-Runtime Split, Multi-Platform Abstraction, Memory Pool Management, Streaming Pipeline.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.