predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Serves thousands of fine-tuned LoRA adapters on a single GPU with dynamic loading
HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses.
Under the hood, the system uses 3 feedback loops, 4 data pools, 4 control points to manage its runtime behavior.
A 6-component ml inference. 218 files analyzed. Data flows through 5 distinct pipeline stages.
How Data Flows Through the System
HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses.
- Accept HTTP request — The router receives HTTP POST requests at /generate endpoint, deserializing JSON into Request objects with input text, generation parameters, and optional adapter_id/adapter_source fields (config: max_input_length, max_total_tokens)
- Validate and load adapter — AdapterLoader checks if the specified adapter exists in cache, downloads from HuggingFace Hub/S3/filesystem if missing, and triggers preload on inference servers via gRPC DownloadAdapter call [Request → LoraConfig] (config: adapter_source, max_active_adapters)
- Schedule heterogeneous batch — Scheduler groups multiple requests with potentially different adapters into a single Batch, optimizing for max_batch_total_tokens and waiting_served_ratio constraints while tracking adapter indices [Request → Batch] (config: max_batch_total_tokens, waiting_served_ratio, max_waiting_tokens)
- Apply multi-LoRA inference — FlashCausalLM processes the batch by running base model forward pass and using LoraLinearLayer components to dynamically apply different adapter weights based on AdapterBatchData indices, leveraging SGMV kernels for efficiency [Batch → Generation] (config: use_sgmv, r, lora_alpha)
- Stream response tokens — Generated tokens are streamed back via gRPC Generation messages, converted to HTTP Server-Sent Events or JSON responses based on the stream parameter in original request [Generation] (config: stream)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
clients/python/lorax/types.pyPydantic model with inputs: str, parameters: Optional[Parameters], stream: bool, adapter_id: Optional[str], adapter_source: Optional[str], merged_adapters: Optional[MergedAdapters]
Created from HTTP request JSON, validated by router, queued for batching, sent to inference server via gRPC
router/client/src/pb.rsgRPC message with id: u64, requests: Vec<Request>, size: u32, max_tokens: u32, containing multiple requests with their adapter configurations
Assembled by scheduler from queued requests, sent to server for parallel processing, results distributed back to individual requests
server/lorax_server/adapters/lora.pydataclass with r: int (rank), target_modules: List[str], lora_alpha: int, use_rslora: bool defining adapter architecture
Loaded from adapter config files, used to configure model layer modifications, cached for reuse across requests
server/lorax_server/utils/adapter.pydataclass with adapter_source: str, adapter_index: int mapping requests to their specific adapters within a batch
Created during batch formation to track which adapter each request uses, consumed by model layers to apply correct LoRA weights
router/client/src/pb.rsgRPC message with request_id: u64, prefill_tokens: NextTokens, tokens: NextTokens, generated_text: Optional[str], finish_reason: FinishReason
Produced by inference server for each request in a batch, contains generated tokens and metadata, streamed back to HTTP clients
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The model's config.json file contains a 'model_type' field that maps to a known model architecture, but there's no validation that the model_type is supported by the inference engine
If this fails: If a model has an unknown or unsupported model_type, the router will accept requests but the inference server may fail silently or produce wrong outputs when trying to load adapter layers for incompatible architectures
router/src/main.rs:get_model_info
The 'ids' and 'weights' lists in MergedAdapters have the same length, validated only at request creation but not when adapters are actually merged in the inference server
If this fails: If the validation in the client is bypassed or the data is corrupted in transit, the inference server will try to merge adapters with mismatched weights arrays, leading to silent wrong outputs or crashes during tensor operations
clients/python/lorax/types.py:MergedAdapters
LoRA adapter files downloaded from HuggingFace Hub, S3, or local filesystem have consistent internal structure (adapter_config.json + adapter_model.bin/safetensors) but no validation of file integrity or compatibility with the base model
If this fails: Corrupted downloads, mismatched adapter architectures, or adapters trained for different base models will be loaded and applied, producing silent wrong inference results instead of failing fast with clear errors
router/src/loader.rs:download_adapter
The max_batch_total_tokens limit accurately reflects available GPU memory across all active adapters, but the calculation doesn't account for dynamic memory usage of different adapter ranks or the base model's varying memory footprint
If this fails: Batches may be accepted that exceed actual GPU memory when multiple high-rank adapters are active simultaneously, causing OOM crashes during inference instead of graceful batch size reduction
router/src/main.rs:max_batch_total_tokens
The 2-second adapter cycling timer is sufficient to detect which adapters are popular and should remain in GPU memory, but this assumes request patterns are stable over short time windows
If this fails: Bursty traffic patterns or adapters with infrequent but regular usage will be incorrectly evicted from GPU memory, causing unnecessary reloading delays and degraded performance for legitimate use cases
router/src/loader.rs:adapter_cycle_time_s
Requests within a batch can be processed in any order since they use different adapters, but the code assumes adapter indices in AdapterBatchData correspond to request order in the batch
If this fails: If batch ordering is modified during processing or adapter indices become misaligned, responses will be returned to the wrong requests, causing users to receive outputs generated with incorrect adapters
router/src/batch.rs:Entry
128 active adapters can fit in GPU memory simultaneously, but this number is hardcoded and doesn't account for varying adapter sizes (rank), GPU memory capacity, or base model size
If this fails: On smaller GPUs or with high-rank adapters, the system will attempt to load more adapters than memory allows, causing OOM crashes. On larger GPUs, memory is underutilized by artificially limiting to 128 adapters
router/src/main.rs:max_active_adapters
The HTTP client assumes network connections are reliable and retries are handled transparently, but streaming responses can be interrupted without proper detection of partial generation
If this fails: Network interruptions during streaming inference will leave requests in an inconsistent state - users may receive partial outputs that they interpret as complete, leading to downstream application errors
clients/python/lorax/client.py:requests.post
gRPC health checks accurately reflect the inference server's ability to process requests, but checks don't validate that adapter loading mechanisms or GPU memory management are functioning correctly
If this fails: The router will continue sending requests to inference servers that report healthy but have broken adapter loading, causing requests to hang or fail with cryptic errors instead of routing to healthy servers
router/src/infer.rs:health_check
Adapter sources 'hub', 'local', 's3', 'pbase' are mutually exclusive and have consistent authentication/access patterns, but the code doesn't validate that adapter_id format matches the specified source
If this fails: Requests with mismatched adapter_id formats (e.g., HuggingFace path used with 's3' source) will either fail with confusing errors or attempt to download from wrong locations, wasting time and potentially exposing authentication tokens
clients/python/lorax/types.py:ADAPTER_SOURCES
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Downloads and caches LoRA adapter weights from various sources to avoid re-downloading on subsequent requests
Buffers incoming requests waiting to be batched, implementing backpressure when system is overloaded
Stores attention key-value tensors for each sequence to enable efficient autoregressive generation without recomputing past tokens
Radix tree storing common prompt prefixes to share computation across requests with similar beginnings
Feedback Loops
- Adapter prefetch cycle (polling, balancing) — Trigger: adapter_cycle_time_s timer expiration. Action: AdapterLoader prefetches popular adapters to GPU memory and offloads unused ones to CPU. Exit: All active adapters fit in GPU memory.
- Backpressure regulation (backpressure, balancing) — Trigger: Queue length exceeds capacity. Action: Router returns HTTP 503 errors and reduces batch sizes. Exit: Queue drains below threshold.
- Health check retry (retry, balancing) — Trigger: gRPC health check failure. Action: Infer retries connection to inference server with exponential backoff. Exit: Server responds healthy or max retries exceeded.
Delays
- Adapter download (async-processing, ~varies by adapter size) — First request for an adapter blocks until weights are downloaded and cached
- Batch accumulation (batch-window, ~waiting_served_ratio * processing_time) — Requests wait for optimal batch size before processing begins
- GPU warmup (warmup, ~model loading time) — First inference request experiences higher latency while model loads onto GPU
Control Points
- max_active_adapters (threshold) — Controls: How many LoRA adapters can be kept in GPU memory simultaneously before offloading to CPU. Default: 128
- use_sgmv (architecture-switch) — Controls: Whether to use SGMV kernels for efficient multi-adapter batch processing or fall back to sequential application. Default: true
- max_batch_total_tokens (threshold) — Controls: Maximum total tokens across all requests in a single batch, limiting GPU memory usage. Default: varies by model
- adapter_cycle_time_s (rate-limit) — Controls: How frequently the system prefetches popular adapters and offloads unused ones. Default: 2
Technology Stack
Powers the high-performance HTTP/gRPC router, request scheduling, and adapter management with memory safety
Runs the actual ML inference using PyTorch with custom CUDA kernels for optimized transformer computation
Enables high-throughput communication between Rust router and Python inference servers with structured message passing
Loads transformer models and executes neural network forward passes with GPU acceleration
Custom kernels (flash-attention, SGMV, paged attention) optimize GPU memory usage and computation for multi-adapter inference
Downloads base models and LoRA adapters from the community model repository
Rust web framework handling HTTP REST API endpoints and WebSocket streaming responses
Validates and serializes request/response data structures in Python client and type definitions
Key Components
- Infer (orchestrator) — Coordinates the entire inference pipeline by managing gRPC communication with model shards, handling health checks, and distributing requests across multiple GPU processes
router/src/infer.rs - Scheduler (scheduler) — Implements heterogeneous continuous batching by grouping requests with different adapters into the same batch while optimizing for throughput and latency constraints
router/src/scheduler.rs - AdapterLoader (loader) — Downloads LoRA adapter weights from various sources (HuggingFace Hub, S3, local filesystem), caches them, and triggers preloading on inference servers
router/src/loader.rs - FlashCausalLM (processor) — Main inference engine that loads base transformer models, dynamically applies LoRA adapters per request, and runs batched forward passes with optimized attention kernels
server/lorax_server/models/flash_causal_lm.py - LoraLinearLayer (adapter) — Wraps transformer linear layers to dynamically apply different LoRA weights based on batch adapter indices, using SGMV kernels for efficient multi-adapter computation
server/lorax_server/layers/linear.py - RadixStateMachine (optimizer) — Manages prefix caching using a radix tree structure to share common prompt prefixes across requests, reducing redundant computation for similar inputs
router/src/radix.rs
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Inference Repositories
Frequently Asked Questions
What is lorax used for?
Serves thousands of fine-tuned LoRA adapters on a single GPU with dynamic loading predibase/lorax is a 6-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 218 files.
How is lorax architected?
lorax is organized into 5 architecture layers: HTTP/gRPC Router, Inference Engine, Adapter Management, Batch Scheduler, and 1 more. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through lorax?
Data moves through 5 stages: Accept HTTP request → Validate and load adapter → Schedule heterogeneous batch → Apply multi-LoRA inference → Stream response tokens. HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses. This pipeline design reflects a complex multi-stage processing system.
What technologies does lorax use?
The core stack includes Rust (Powers the high-performance HTTP/gRPC router, request scheduling, and adapter management with memory safety), Python (Runs the actual ML inference using PyTorch with custom CUDA kernels for optimized transformer computation), gRPC (Enables high-throughput communication between Rust router and Python inference servers with structured message passing), PyTorch (Loads transformer models and executes neural network forward passes with GPU acceleration), CUDA (Custom kernels (flash-attention, SGMV, paged attention) optimize GPU memory usage and computation for multi-adapter inference), HuggingFace Hub (Downloads base models and LoRA adapters from the community model repository), and 2 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does lorax have?
lorax exhibits 4 data pools (Adapter cache, Request queue), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle polling and backpressure. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does lorax use?
4 design patterns detected: Heterogeneous batching, Just-in-time adapter loading, Multi-process orchestration, Prefix caching.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.