predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

3,756 stars Python 6 components

Serves thousands of fine-tuned LoRA adapters on a single GPU with dynamic loading

HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses.

Under the hood, the system uses 3 feedback loops, 4 data pools, 4 control points to manage its runtime behavior.

A 6-component ml inference. 218 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses.

  1. Accept HTTP request — The router receives HTTP POST requests at /generate endpoint, deserializing JSON into Request objects with input text, generation parameters, and optional adapter_id/adapter_source fields (config: max_input_length, max_total_tokens)
  2. Validate and load adapter — AdapterLoader checks if the specified adapter exists in cache, downloads from HuggingFace Hub/S3/filesystem if missing, and triggers preload on inference servers via gRPC DownloadAdapter call [Request → LoraConfig] (config: adapter_source, max_active_adapters)
  3. Schedule heterogeneous batch — Scheduler groups multiple requests with potentially different adapters into a single Batch, optimizing for max_batch_total_tokens and waiting_served_ratio constraints while tracking adapter indices [Request → Batch] (config: max_batch_total_tokens, waiting_served_ratio, max_waiting_tokens)
  4. Apply multi-LoRA inference — FlashCausalLM processes the batch by running base model forward pass and using LoraLinearLayer components to dynamically apply different adapter weights based on AdapterBatchData indices, leveraging SGMV kernels for efficiency [Batch → Generation] (config: use_sgmv, r, lora_alpha)
  5. Stream response tokens — Generated tokens are streamed back via gRPC Generation messages, converted to HTTP Server-Sent Events or JSON responses based on the stream parameter in original request [Generation] (config: stream)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

Request clients/python/lorax/types.py
Pydantic model with inputs: str, parameters: Optional[Parameters], stream: bool, adapter_id: Optional[str], adapter_source: Optional[str], merged_adapters: Optional[MergedAdapters]
Created from HTTP request JSON, validated by router, queued for batching, sent to inference server via gRPC
Batch router/client/src/pb.rs
gRPC message with id: u64, requests: Vec<Request>, size: u32, max_tokens: u32, containing multiple requests with their adapter configurations
Assembled by scheduler from queued requests, sent to server for parallel processing, results distributed back to individual requests
LoraConfig server/lorax_server/adapters/lora.py
dataclass with r: int (rank), target_modules: List[str], lora_alpha: int, use_rslora: bool defining adapter architecture
Loaded from adapter config files, used to configure model layer modifications, cached for reuse across requests
AdapterBatchData server/lorax_server/utils/adapter.py
dataclass with adapter_source: str, adapter_index: int mapping requests to their specific adapters within a batch
Created during batch formation to track which adapter each request uses, consumed by model layers to apply correct LoRA weights
Generation router/client/src/pb.rs
gRPC message with request_id: u64, prefill_tokens: NextTokens, tokens: NextTokens, generated_text: Optional[str], finish_reason: FinishReason
Produced by inference server for each request in a batch, contains generated tokens and metadata, streamed back to HTTP clients

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Contract unguarded

The model's config.json file contains a 'model_type' field that maps to a known model architecture, but there's no validation that the model_type is supported by the inference engine

If this fails: If a model has an unknown or unsupported model_type, the router will accept requests but the inference server may fail silently or produce wrong outputs when trying to load adapter layers for incompatible architectures

router/src/main.rs:get_model_info
critical Shape weakly guarded

The 'ids' and 'weights' lists in MergedAdapters have the same length, validated only at request creation but not when adapters are actually merged in the inference server

If this fails: If the validation in the client is bypassed or the data is corrupted in transit, the inference server will try to merge adapters with mismatched weights arrays, leading to silent wrong outputs or crashes during tensor operations

clients/python/lorax/types.py:MergedAdapters
critical Domain unguarded

LoRA adapter files downloaded from HuggingFace Hub, S3, or local filesystem have consistent internal structure (adapter_config.json + adapter_model.bin/safetensors) but no validation of file integrity or compatibility with the base model

If this fails: Corrupted downloads, mismatched adapter architectures, or adapters trained for different base models will be loaded and applied, producing silent wrong inference results instead of failing fast with clear errors

router/src/loader.rs:download_adapter
critical Scale weakly guarded

The max_batch_total_tokens limit accurately reflects available GPU memory across all active adapters, but the calculation doesn't account for dynamic memory usage of different adapter ranks or the base model's varying memory footprint

If this fails: Batches may be accepted that exceed actual GPU memory when multiple high-rank adapters are active simultaneously, causing OOM crashes during inference instead of graceful batch size reduction

router/src/main.rs:max_batch_total_tokens
warning Temporal unguarded

The 2-second adapter cycling timer is sufficient to detect which adapters are popular and should remain in GPU memory, but this assumes request patterns are stable over short time windows

If this fails: Bursty traffic patterns or adapters with infrequent but regular usage will be incorrectly evicted from GPU memory, causing unnecessary reloading delays and degraded performance for legitimate use cases

router/src/loader.rs:adapter_cycle_time_s
critical Ordering unguarded

Requests within a batch can be processed in any order since they use different adapters, but the code assumes adapter indices in AdapterBatchData correspond to request order in the batch

If this fails: If batch ordering is modified during processing or adapter indices become misaligned, responses will be returned to the wrong requests, causing users to receive outputs generated with incorrect adapters

router/src/batch.rs:Entry
warning Resource unguarded

128 active adapters can fit in GPU memory simultaneously, but this number is hardcoded and doesn't account for varying adapter sizes (rank), GPU memory capacity, or base model size

If this fails: On smaller GPUs or with high-rank adapters, the system will attempt to load more adapters than memory allows, causing OOM crashes. On larger GPUs, memory is underutilized by artificially limiting to 128 adapters

router/src/main.rs:max_active_adapters
warning Environment weakly guarded

The HTTP client assumes network connections are reliable and retries are handled transparently, but streaming responses can be interrupted without proper detection of partial generation

If this fails: Network interruptions during streaming inference will leave requests in an inconsistent state - users may receive partial outputs that they interpret as complete, leading to downstream application errors

clients/python/lorax/client.py:requests.post
warning Contract weakly guarded

gRPC health checks accurately reflect the inference server's ability to process requests, but checks don't validate that adapter loading mechanisms or GPU memory management are functioning correctly

If this fails: The router will continue sending requests to inference servers that report healthy but have broken adapter loading, causing requests to hang or fail with cryptic errors instead of routing to healthy servers

router/src/infer.rs:health_check
info Domain unguarded

Adapter sources 'hub', 'local', 's3', 'pbase' are mutually exclusive and have consistent authentication/access patterns, but the code doesn't validate that adapter_id format matches the specified source

If this fails: Requests with mismatched adapter_id formats (e.g., HuggingFace path used with 's3' source) will either fail with confusing errors or attempt to download from wrong locations, wasting time and potentially exposing authentication tokens

clients/python/lorax/types.py:ADAPTER_SOURCES

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Adapter cache (file-store)
Downloads and caches LoRA adapter weights from various sources to avoid re-downloading on subsequent requests
Request queue (in-memory)
Buffers incoming requests waiting to be batched, implementing backpressure when system is overloaded
KV cache (buffer)
Stores attention key-value tensors for each sequence to enable efficient autoregressive generation without recomputing past tokens
Prefix cache (cache)
Radix tree storing common prompt prefixes to share computation across requests with similar beginnings

Feedback Loops

Delays

Control Points

Technology Stack

Rust (runtime)
Powers the high-performance HTTP/gRPC router, request scheduling, and adapter management with memory safety
Python (runtime)
Runs the actual ML inference using PyTorch with custom CUDA kernels for optimized transformer computation
gRPC (framework)
Enables high-throughput communication between Rust router and Python inference servers with structured message passing
PyTorch (library)
Loads transformer models and executes neural network forward passes with GPU acceleration
CUDA (compute)
Custom kernels (flash-attention, SGMV, paged attention) optimize GPU memory usage and computation for multi-adapter inference
HuggingFace Hub (library)
Downloads base models and LoRA adapters from the community model repository
Axum (framework)
Rust web framework handling HTTP REST API endpoints and WebSocket streaming responses
Pydantic (serialization)
Validates and serializes request/response data structures in Python client and type definitions

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is lorax used for?

Serves thousands of fine-tuned LoRA adapters on a single GPU with dynamic loading predibase/lorax is a 6-component ml inference written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 218 files.

How is lorax architected?

lorax is organized into 5 architecture layers: HTTP/gRPC Router, Inference Engine, Adapter Management, Batch Scheduler, and 1 more. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through lorax?

Data moves through 5 stages: Accept HTTP request → Validate and load adapter → Schedule heterogeneous batch → Apply multi-LoRA inference → Stream response tokens. HTTP requests with adapter specifications flow through a Rust router that validates them, downloads any missing adapters, and batches requests together regardless of which adapters they use. The batches are sent via gRPC to Python inference servers that load the base model once and dynamically apply different LoRA adapters per request within the same batch, generating text that flows back through the router to HTTP responses. This pipeline design reflects a complex multi-stage processing system.

What technologies does lorax use?

The core stack includes Rust (Powers the high-performance HTTP/gRPC router, request scheduling, and adapter management with memory safety), Python (Runs the actual ML inference using PyTorch with custom CUDA kernels for optimized transformer computation), gRPC (Enables high-throughput communication between Rust router and Python inference servers with structured message passing), PyTorch (Loads transformer models and executes neural network forward passes with GPU acceleration), CUDA (Custom kernels (flash-attention, SGMV, paged attention) optimize GPU memory usage and computation for multi-adapter inference), HuggingFace Hub (Downloads base models and LoRA adapters from the community model repository), and 2 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does lorax have?

lorax exhibits 4 data pools (Adapter cache, Request queue), 3 feedback loops, 4 control points, 3 delays. The feedback loops handle polling and backpressure. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does lorax use?

4 design patterns detected: Heterogeneous batching, Just-in-time adapter loading, Multi-process orchestration, Prefix caching.

Analyzed on April 20, 2026 by CodeSea. Written by .