mlc-ai/mlc-llm

Universal LLM Deployment Engine with ML Compilation

22,494 stars Python 8 components

Compiles and runs large language models efficiently across diverse hardware platforms

Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component ml inference. 508 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses.

  1. Parse API request — FastAPIServer receives HTTP POST to /v1/chat/completions, deserializes JSON body into ChatCompletionRequest using Pydantic validation, checking required fields like messages and model name
  2. Initialize request stream — RequestStreamManager creates a unique request ID, sets up streaming response channel if needed, and queues the request for processing by the engine [ChatCompletionRequest → GenerationRequest]
  3. Tokenize input messages — TokenizerManager processes the conversation history through SentencePiece or tiktoken, converting text to token IDs and adding special tokens for chat format [ChatCompletionMessage → TokenSequence]
  4. Execute model inference — ServeEngine loads the compiled model binary, allocates KV-cache blocks, runs forward pass through transformer layers using TVM-optimized kernels, generating logits for next token prediction [GenerationRequest → TokenSequence]
  5. Stream response tokens — Generated tokens are decoded back to text, wrapped in OpenAI-compatible response format with usage statistics, and streamed as Server-Sent Events or returned as complete response [TokenSequence → ChatCompletionResponse]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ChatCompletionRequest python/mlc_llm/protocol/openai_api_protocol.py
Pydantic model with messages: List[ChatCompletionMessage], model: str, temperature: float, max_tokens: int, stream: bool, response_format: dict
Created from HTTP request JSON, validated against schema, passed to engine for processing, then discarded after response generation
ChatCompletionMessage python/mlc_llm/protocol/openai_api_protocol.py
Pydantic model with role: str ('user'|'assistant'|'system'), content: Union[str, List[Dict]], tool_calls: List[ChatToolCall]
Extracted from chat request, converted to token sequences by tokenizer, used to build conversation context for generation
GenerationRequest cpp/serve/request.h
C++ struct with request_id: string, input_ids: vector<int>, generation_config: GenerationConfig, stream: bool
Created from API request, queued in request pool, processed by generation engine, results streamed back to client
CompiledModel python/mlc_llm/compile
Directory structure with model_lib.so (compiled kernels), params/ (quantized weights), mlc-chat-config.json (model metadata)
Generated during compilation from source model, stored on disk, loaded by runtime engine for inference
TokenSequence cpp/tokenizers
vector<int32_t> representing token IDs with special tokens for BOS, EOS, padding
Created by tokenizer from text input, fed to model for generation, decoded back to text in streaming response

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Domain unguarded

assumes image URIs passed to MessageData always point to valid, readable image files that can be decoded by BitmapFactory

If this fails: if URI points to corrupted file, network location, or non-image content, BitmapFactory.decodeStream() returns null causing NullPointerException when converting to base64 string

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:imageUri parameter
critical Resource unguarded

assumes external storage directory exists and is writable, and that model files have been pre-installed in the exact expected path structure

If this fails: if external storage is unavailable, full, or model directory missing, MLCEngine initialization fails with cryptic native library errors instead of clear file not found errors

android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:modelPath creation
critical Contract unguarded

assumes all image bitmaps can be compressed to JPEG format and the resulting byte array fits in memory for base64 encoding

If this fails: large images (>50MB uncompressed) cause OutOfMemoryError during bitmap.compress() or Base64.encode(), crashing the chat session

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:Base64 encoding
critical Environment weakly guarded

assumes adb is installed, in PATH, device is connected via USB with debugging enabled, and device has sufficient storage space at /data/local/tmp/

If this fails: script fails silently or with cryptic errors if adb not found, device disconnected, or device storage full during weight file push

android/MLCChat/bundle_weight.py:adb commands
warning Scale unguarded

assumes the hardcoded model library name 'phi_msft_q4f16_1_686d8979c6ebf05d142d9081f1b87162' exactly matches the compiled .so file name in the app bundle

If this fails: if model is recompiled with different hash or quantization settings, native library loading fails with UnsatisfiedLinkError but no indication of name mismatch

android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:modelLib hardcoded string
warning Temporal unguarded

assumes image URIs returned by ActivityResultContracts.GetContent() remain valid and accessible for the lifetime of the chat session

If this fails: if user deletes image file or URI permissions are revoked after selection, subsequent access for message display or model input fails with SecurityException

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/MainActivity.kt:pickImageLauncher
warning Ordering weakly guarded

assumes coroutine operations for chat completion can be cancelled cleanly when ViewModel is destroyed, without leaving native inference threads running

If this fails: if user rapidly switches between chat screens or force-closes app during generation, native MLCEngine threads may continue consuming GPU resources until process death

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:viewModelScope.launch usage
warning Resource unguarded

assumes Android app has storage permissions and the hardcoded path '/storage/emulated/0/Android/data/ai.mlc.mlcchat/files/' is writable on all Android versions

If this fails: on Android 11+ with scoped storage, app may not have write access to this path causing weight file installation to fail silently

android/MLCChat/bundle_weight.py:device_weihgt_dir path
info Domain unguarded

assumes UUID.randomUUID() generates unique identifiers across all chat sessions and app restarts for message tracking

If this fails: while UUID collision is extremely unlikely, if it occurs, message updates could overwrite wrong messages in UI causing user confusion

android/MLCChat/app/src/main/java/ai/mlc/mlcchat/AppViewModel.kt:UUID generation for message IDs
info Environment unguarded

assumes GlobalScope coroutines are appropriate for MLCEngine operations and will not prevent proper app lifecycle cleanup

If this fails: GlobalScope coroutines survive activity destruction, potentially keeping MLCEngine references alive and preventing proper GPU memory cleanup

android/MLCEngineExample/app/src/main/java/ai/mlc/mlcengineexample/MainActivity.kt:GlobalScope.launch usage

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

ModelRegistry (registry)
Maps model names to compiled binary paths and configuration metadata for runtime loading
KVCachePool (buffer)
Pre-allocated memory blocks for transformer attention key-value pairs, reused across requests with paging
RequestQueue (queue)
FIFO queue of pending generation requests with priority handling and batching logic
ResponseStreamPool (in-memory)
Active streaming connections with backpressure control and cleanup on client disconnect

Feedback Loops

Delays

Control Points

Technology Stack

TVM (compute)
Compiles model operations into optimized kernels for different hardware backends
FastAPI (framework)
Provides OpenAI-compatible HTTP API endpoints for chat completions and embeddings
Pydantic (serialization)
Validates and serializes API request/response schemas following OpenAI specification
CMake (build)
Builds cross-platform C++ inference engine with hardware-specific optimizations
Android NDK (runtime)
Compiles C++ engine for Android ARM64 with OpenCL GPU acceleration
SentencePiece (library)
Tokenizes text input using vocabulary from trained language models
Transformers (library)
Loads model weights and configurations from HuggingFace format during compilation
Kotlin Coroutines (runtime)
Manages asynchronous streaming responses in Android applications

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is mlc-llm used for?

Compiles and runs large language models efficiently across diverse hardware platforms mlc-ai/mlc-llm is a 8-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 508 files.

How is mlc-llm architected?

mlc-llm is organized into 4 architecture layers: Model Compilation, Runtime Engine, Platform Bindings, Protocol Layer. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through mlc-llm?

Data moves through 5 stages: Parse API request → Initialize request stream → Tokenize input messages → Execute model inference → Stream response tokens. Requests enter through HTTP API or platform bindings, get parsed into OpenAI protocol messages, tokenized into integer sequences, processed by the compiled model engine with KV-cache management, and streamed back as generated text tokens that are decoded and formatted as JSON responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does mlc-llm use?

The core stack includes TVM (Compiles model operations into optimized kernels for different hardware backends), FastAPI (Provides OpenAI-compatible HTTP API endpoints for chat completions and embeddings), Pydantic (Validates and serializes API request/response schemas following OpenAI specification), CMake (Builds cross-platform C++ inference engine with hardware-specific optimizations), Android NDK (Compiles C++ engine for Android ARM64 with OpenCL GPU acceleration), SentencePiece (Tokenizes text input using vocabulary from trained language models), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does mlc-llm have?

mlc-llm exhibits 4 data pools (ModelRegistry, KVCachePool), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle recursive and backpressure. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does mlc-llm use?

4 design patterns detected: Compilation-Runtime Split, Multi-Platform Abstraction, Memory Pool Management, Streaming Pipeline.

Analyzed on April 20, 2026 by CodeSea. Written by .