predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

3,756 stars Python 6 components

Serves thousands of fine-tuned LoRA adapters on a single GPU with dynamic loading

HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses.

Under the hood, the system uses 3 feedback loops, 4 data pools, 4 control points to manage its runtime behavior.

A 6-component ml inference. 218 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Accept HTTP request — The router receives HTTP POST requests at /generate endpoint, deserializing JSON into Request objects with input text, generation parameters, and optional adapter_id/adapter_source fields (config: max_input_length, max_total_tokens)
Validate and load adapter — AdapterLoader checks if the specified adapter exists in cache, downloads from HuggingFace Hub/S3/filesystem if missing, and triggers preload on inference servers via gRPC DownloadAdapter call [Request → LoraConfig] (config: adapter_source, max_active_adapters)
Schedule heterogeneous batch — Scheduler groups multiple requests with potentially different adapters into a single Batch, optimizing for max_batch_total_tokens and waiting_served_ratio constraints while tracking adapter indices [Request → Batch] (config: max_batch_total_tokens, waiting_served_ratio, max_waiting_tokens)
Apply multi-LoRA inference — FlashCausalLM processes the batch by running base model forward pass and using LoraLinearLayer components to dynamically apply different adapter weights based on AdapterBatchData indices, leveraging SGMV kernels for efficiency [Batch → Generation] (config: use_sgmv, r, lora_alpha)
Stream response tokens — Generated tokens are streamed back via gRPC Generation messages, converted to HTTP Server-Sent Events or JSON responses based on the stream parameter in original request [Generation] (config: stream)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Request clients/python/lorax/types.py
Pydantic model with inputs: str, parameters: Optional[Parameters], stream: bool, adapter_id: Optional[str], adapter_source: Optional[str], merged_adapters: Optional[MergedAdapters]
Created from HTTP request JSON, validated by router, queued for batching, sent to inference server via gRPC

Batch router/client/src/pb.rs
gRPC message with id: u64, requests: Vec<Request>, size: u32, max_tokens: u32, containing multiple requests with their adapter configurations
Assembled by scheduler from queued requests, sent to server for parallel processing, results distributed back to individual requests

LoraConfig server/lorax_server/adapters/lora.py
dataclass with r: int (rank), target_modules: List[str], lora_alpha: int, use_rslora: bool defining adapter architecture
Loaded from adapter config files, used to configure model layer modifications, cached for reuse across requests

AdapterBatchData server/lorax_server/utils/adapter.py
dataclass with adapter_source: str, adapter_index: int mapping requests to their specific adapters within a batch
Created during batch formation to track which adapter each request uses, consumed by model layers to apply correct LoRA weights

Generation router/client/src/pb.rs
gRPC message with request_id: u64, prefill_tokens: NextTokens, tokens: NextTokens, generated_text: Optional[str], finish_reason: FinishReason
Produced by inference server for each request in a batch, contains generated tokens and metadata, streamed back to HTTP clients

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Contract unguarded

The model's config.json file contains a 'model_type' field that maps to a known model architecture, but there's no validation that the model_type is supported by the inference engine

If this fails: If a model has an unknown or unsupported model_type, the router will accept requests but the inference server may fail silently or produce wrong outputs when trying to load adapter layers for incompatible architectures

router/src/main.rs:get_model_info

critical Shape weakly guarded

The 'ids' and 'weights' lists in MergedAdapters have the same length, validated only at request creation but not when adapters are actually merged in the inference server

If this fails: If the validation in the client is bypassed or the data is corrupted in transit, the inference server will try to merge adapters with mismatched weights arrays, leading to silent wrong outputs or crashes during tensor operations

clients/python/lorax/types.py:MergedAdapters

critical Domain unguarded

LoRA adapter files downloaded from HuggingFace Hub, S3, or local filesystem have consistent internal structure (adapter_config.json + adapter_model.bin/safetensors) but no validation of file integrity or compatibility with the base model

If this fails: Corrupted downloads, mismatched adapter architectures, or adapters trained for different base models will be loaded and applied, producing silent wrong inference results instead of failing fast with clear errors

router/src/loader.rs:download_adapter

critical Scale weakly guarded

The max_batch_total_tokens limit accurately reflects available GPU memory across all active adapters, but the calculation doesn't account for dynamic memory usage of different adapter ranks or the base model's varying memory footprint

If this fails: Batches may be accepted that exceed actual GPU memory when multiple high-rank adapters are active simultaneously, causing OOM crashes during inference instead of graceful batch size reduction

router/src/main.rs:max_batch_total_tokens

warning Temporal unguarded

The 2-second adapter cycling timer is sufficient to detect which adapters are popular and should remain in GPU memory, but this assumes request patterns are stable over short time windows

If this fails: Bursty traffic patterns or adapters with infrequent but regular usage will be incorrectly evicted from GPU memory, causing unnecessary reloading delays and degraded performance for legitimate use cases

router/src/loader.rs:adapter_cycle_time_s

critical Ordering unguarded

Requests within a batch can be processed in any order since they use different adapters, but the code assumes adapter indices in AdapterBatchData correspond to request order in the batch

If this fails: If batch ordering is modified during processing or adapter indices become misaligned, responses will be returned to the wrong requests, causing users to receive outputs generated with incorrect adapters

router/src/batch.rs:Entry

warning Resource unguarded

128 active adapters can fit in GPU memory simultaneously, but this number is hardcoded and doesn't account for varying adapter sizes (rank), GPU memory capacity, or base model size

If this fails: On smaller GPUs or with high-rank adapters, the system will attempt to load more adapters than memory allows, causing OOM crashes. On larger GPUs, memory is underutilized by artificially limiting to 128 adapters

router/src/main.rs:max_active_adapters

warning Environment weakly guarded

The HTTP client assumes network connections are reliable and retries are handled transparently, but streaming responses can be interrupted without proper detection of partial generation

If this fails: Network interruptions during streaming inference will leave requests in an inconsistent state - users may receive partial outputs that they interpret as complete, leading to downstream application errors

clients/python/lorax/client.py:requests.post

warning Contract weakly guarded

gRPC health checks accurately reflect the inference server's ability to process requests, but checks don't validate that adapter loading mechanisms or GPU memory management are functioning correctly

If this fails: The router will continue sending requests to inference servers that report healthy but have broken adapter loading, causing requests to hang or fail with cryptic errors instead of routing to healthy servers

router/src/infer.rs:health_check

info Domain unguarded

Adapter sources 'hub', 'local', 's3', 'pbase' are mutually exclusive and have consistent authentication/access patterns, but the code doesn't validate that adapter_id format matches the specified source

If this fails: Requests with mismatched adapter_id formats (e.g., HuggingFace path used with 's3' source) will either fail with confusing errors or attempt to download from wrong locations, wasting time and potentially exposing authentication tokens

clients/python/lorax/types.py:ADAPTER_SOURCES

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Adapter cache (file-store)
Downloads and caches LoRA adapter weights from various sources to avoid re-downloading on subsequent requests

Request queue (in-memory)
Buffers incoming requests waiting to be batched, implementing backpressure when system is overloaded

KV cache (buffer)
Stores attention key-value tensors for each sequence to enable efficient autoregressive generation without recomputing past tokens

Prefix cache (cache)
Radix tree storing common prompt prefixes to share computation across requests with similar beginnings

Feedback Loops

Adapter prefetch cycle (polling, balancing) — Trigger: adapter_cycle_time_s timer expiration. Action: AdapterLoader prefetches popular adapters to GPU memory and offloads unused ones to CPU. Exit: All active adapters fit in GPU memory.
Backpressure regulation (backpressure, balancing) — Trigger: Queue length exceeds capacity. Action: Router returns HTTP 503 errors and reduces batch sizes. Exit: Queue drains below threshold.
Health check retry (retry, balancing) — Trigger: gRPC health check failure. Action: Infer retries connection to inference server with exponential backoff. Exit: Server responds healthy or max retries exceeded.

Delays

Adapter download (async-processing, ~varies by adapter size) — First request for an adapter blocks until weights are downloaded and cached
Batch accumulation (batch-window, ~waiting_served_ratio * processing_time) — Requests wait for optimal batch size before processing begins
GPU warmup (warmup, ~model loading time) — First inference request experiences higher latency while model loads onto GPU

Control Points

max_active_adapters (threshold) — Controls: How many LoRA adapters can be kept in GPU memory simultaneously before offloading to CPU. Default: 128
use_sgmv (architecture-switch) — Controls: Whether to use SGMV kernels for efficient multi-adapter batch processing or fall back to sequential application. Default: true
max_batch_total_tokens (threshold) — Controls: Maximum total tokens across all requests in a single batch, limiting GPU memory usage. Default: varies by model
adapter_cycle_time_s (rate-limit) — Controls: How frequently the system prefetches popular adapters and offloads unused ones. Default: 2

Technology Stack

Rust (runtime)
Powers the high-performance HTTP/gRPC router, request scheduling, and adapter management with memory safety

Python (runtime)
Runs the actual ML inference using PyTorch with custom CUDA kernels for optimized transformer computation

gRPC (framework)
Enables high-throughput communication between Rust router and Python inference servers with structured message passing

PyTorch (library)
Loads transformer models and executes neural network forward passes with GPU acceleration

CUDA (compute)
Custom kernels (flash-attention, SGMV, paged attention) optimize GPU memory usage and computation for multi-adapter inference

HuggingFace Hub (library)
Downloads base models and LoRA adapters from the community model repository

Axum (framework)
Rust web framework handling HTTP REST API endpoints and WebSocket streaming responses

Pydantic (serialization)
Validates and serializes request/response data structures in Python client and type definitions

Key Components

Infer (orchestrator) — Coordinates the entire inference pipeline by managing gRPC communication with model shards, handling health checks, and distributing requests across multiple GPU processes router/src/infer.rs
Scheduler (scheduler) — Implements heterogeneous continuous batching by grouping requests with different adapters into the same batch while optimizing for throughput and latency constraints router/src/scheduler.rs
AdapterLoader (loader) — Downloads LoRA adapter weights from various sources (HuggingFace Hub, S3, local filesystem), caches them, and triggers preloading on inference servers router/src/loader.rs
FlashCausalLM (processor) — Main inference engine that loads base transformer models, dynamically applies LoRA adapters per request, and runs batched forward passes with optimized attention kernels server/lorax_server/models/flash_causal_lm.py
LoraLinearLayer (adapter) — Wraps transformer linear layers to dynamically apply different LoRA weights based on batch adapter indices, using SGMV kernels for efficient multi-adapter computation server/lorax_server/layers/linear.py
RadixStateMachine (optimizer) — Manages prefix caching using a radix tree structure to share common prompt prefixes across requests, reducing redundant computation for similar inputs router/src/radix.rs

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is lorax used for?

Serves thousands of fine-tuned LoRA adapters on a single GPU with dynamic loading predibase/lorax is a 6-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 218 files.

How is lorax architected?

lorax is organized into 5 architecture layers: HTTP/gRPC Router, Inference Engine, Adapter Management, Batch Scheduler, and 1 more. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through lorax?

Data moves through 5 stages: Accept HTTP request → Validate and load adapter → Schedule heterogeneous batch → Apply multi-LoRA inference → Stream response tokens. HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does lorax use?

The core stack includes Rust (Powers the high-performance HTTP/gRPC router, request scheduling, and adapter management with memory safety), Python (Runs the actual ML inference using PyTorch with custom CUDA kernels for optimized transformer computation), gRPC (Enables high-throughput communication between Rust router and Python inference servers with structured message passing), PyTorch (Loads transformer models and executes neural network forward passes with GPU acceleration), CUDA (Custom kernels (flash-attention, SGMV, paged attention) optimize GPU memory usage and computation for multi-adapter inference), HuggingFace Hub (Downloads base models and LoRA adapters from the community model repository), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does lorax have?

lorax exhibits 4 data pools (Adapter cache, Request queue), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle polling and backpressure. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does lorax use?

4 design patterns detected: Heterogeneous batching, Just-in-time adapter loading, Multi-process orchestration, Prefix caching.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.