predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Multi-LoRA inference server for serving thousands of fine-tuned LLMs
Requests flow from clients through the router to the inference server, where models and adapters are dynamically loaded and applied during generation
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
Structural Verdict
A 11-component ml training with 17 connections. 218 files analyzed. Highly interconnected — components depend on each other heavily.
How Data Flows Through the System
Requests flow from clients through the router to the inference server, where models and adapters are dynamically loaded and applied during generation
- Request Reception — HTTP/gRPC requests received with adapter specifications (config: adapter_id, adapter_source, merged_adapters)
- Validation — Tokenization and parameter validation by ValidationWorker (config: max_input_length, max_new_tokens)
- Adapter Loading — Dynamic loading of LoRA adapters from hub, local, or S3 sources (config: adapter_source, api_token)
- Batch Formation — Heterogeneous batching of requests across different adapters (config: max_batch_total_tokens, max_concurrent_requests)
- Inference — Model forward pass with SGMV LoRA application and flash attention (config: do_sample, temperature, top_k +1)
- Generation — Token generation with stopping criteria and streaming support (config: stop, max_new_tokens, stream)
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
In-memory cache of loaded LoRA adapters with LRU eviction
Pending requests waiting for batching
CUDA memory buffers for model weights and KV cache
LoRA adapter parameters stored on GPU/CPU
Feedback Loops
- Adapter Scheduling (polling, balancing) — Trigger: adapter_cycle_time_s timer. Action: Check adapter usage and prefetch popular adapters. Exit: Server shutdown.
- Dynamic Batching (auto-scale, balancing) — Trigger: Queue depth and waiting time thresholds. Action: Adjust batch size based on current load. Exit: Optimal batch size reached.
- Memory Management (cache-invalidation, balancing) — Trigger: GPU memory pressure. Action: Offload least recently used adapters to CPU. Exit: Memory usage within limits.
Delays & Async Processing
- Adapter Loading (async-processing, ~variable (depends on adapter size and source)) — First request to new adapter has higher latency
- Batch Formation (batch-window, ~waiting_served_ratio * average_serving_time) — Requests wait for optimal batch formation
- Model Prefill (async-processing, ~proportional to input length) — Initial tokens take longer than decode phase
- Adapter Exchange (scheduled-job, ~adapter_cycle_time_s) — Periodic adapter prefetching and offloading
Control Points
- Max Active Adapters (threshold) — Controls: Number of simultaneously loaded adapters. Default: 128
- Waiting Served Ratio (threshold) — Controls: Balance between latency and throughput in batching. Default: 0.3
- Max Batch Total Tokens (threshold) — Controls: Maximum tokens processed in a single batch. Default: 4096
- Adapter Cycle Time (runtime-toggle) — Controls: Frequency of adapter prefetching and scheduling. Default: 2s
- Use SGMV (feature-flag) — Controls: Whether to use SGMV kernels for LoRA operations. Default: true
Technology Stack
HTTP server and request routing
Inference engine and model loading
Deep learning framework
Rust web framework
gRPC implementation
Model and adapter repository
Optimized attention kernels
GPU acceleration kernels
Data validation and serialization
Containerization
Key Components
- Client (class) — Main Python client for making requests to LoRAX server with adapter support
clients/python/lorax/client.py - ShardedClient (class) — gRPC client for communicating with distributed inference shards
router/client/src/sharded_client.rs - Infer (service) — Core inference orchestration handling request batching and adapter scheduling
router/src/infer.rs - AdapterLoader (service) — Manages dynamic loading and caching of LoRA adapters from various sources
router/src/loader.rs - FlashCausalLM (class) — Optimized causal language model implementation with flash attention and LoRA support
server/lorax_server/models/flash_causal_lm.py - LoraLinear (class) — Linear layer with dynamic LoRA adapter application using SGMV kernels
server/lorax_server/layers/lora.py - BatchingQueue (class) — Queue management for heterogeneous batching across different adapters
router/src/queue.rs - Scheduler (service) — Schedules adapter prefetching and request batching for optimal throughput
router/src/scheduler.rs - MedusaModel (class) — Speculative decoding implementation for faster inference with LoRA adapters
server/lorax_server/models/medusa.py - ValidationWorker (class) — Validates and preprocesses incoming requests with tokenization
router/src/validation.rs - Parameters (type-def) — Configuration parameters for generation including adapter settings
clients/python/lorax/types.py
Sub-Modules
Python library for interacting with LoRAX server
Server process management and configuration
End-to-end testing with Docker containers
Configuration
clients/python/lorax/types.py (python-pydantic)
ids(List[str], unknown)weights(List[float], unknown)merge_strategy(Optional[str], unknown) — default: Nonedensity(float, unknown)majority_sign_method(Optional[str], unknown) — default: None
clients/python/lorax/types.py (python-pydantic)
adapter_id(Optional[str], unknown) — default: Noneadapter_source(Optional[str], unknown) — default: Nonemerged_adapters(Optional[MergedAdapters], unknown) — default: Noneapi_token(Optional[str], unknown) — default: Nonedo_sample(bool, unknown) — default: Falsemax_new_tokens(Optional[int], unknown) — default: Noneignore_eos_token(bool, unknown) — default: Falserepetition_penalty(Optional[float], unknown) — default: None- +14 more parameters
clients/python/lorax/types.py (python-pydantic)
inputs(str, unknown)parameters(Optional[Parameters], unknown) — default: Nonestream(bool, unknown) — default: False
clients/python/lorax/types.py (python-pydantic)
inputs(List[str], unknown)parameters(Optional[Parameters], unknown) — default: Nonestream(bool, unknown) — default: False
Science Pipeline
- Tokenization — Convert text to token IDs using HuggingFace tokenizer [string → (sequence_length,)]
router/src/validation.rs - Embedding Lookup — Token IDs to embeddings via embedding layer [(batch_size, sequence_length) → (batch_size, sequence_length, hidden_size)]
server/lorax_server/models/flash_causal_lm.py - LoRA Application — Apply adapter weights using SGMV kernel: x + (x @ A @ B) * scaling [(batch_size, sequence_length, hidden_size) → (batch_size, sequence_length, hidden_size)]
server/lorax_server/layers/lora.py - Attention — Multi-head attention with flash attention optimization [(batch_size, sequence_length, hidden_size) → (batch_size, sequence_length, hidden_size)]
server/lorax_server/models/flash_causal_lm.py - Token Generation — Logits to next token via sampling or greedy selection [(batch_size, vocab_size) → (batch_size, 1)]
server/lorax_server/models/flash_causal_lm.py
Assumptions & Constraints
- [warning] Assumes input tensor is (batch_size, sequence_length, hidden_size) but no explicit shape validation (shape)
- [critical] Assumes all tensors are on CUDA device without device checking (device)
- [info] LoRA rank (r) assumed to be positive integer but no validation enforced (value-range)
- [critical] SGMV kernels expect specific tensor layouts and dtypes (float16/bfloat16) without runtime checks (format)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is lorax used for?
Multi-LoRA inference server for serving thousands of fine-tuned LLMs predibase/lorax is a 11-component ml training written in Python. Highly interconnected — components depend on each other heavily. The codebase contains 218 files.
How is lorax architected?
lorax is organized into 4 architecture layers: Client Layer, Router Layer, Inference Engine, System Layer. Highly interconnected — components depend on each other heavily. This layered structure enables tight integration between components.
How does data flow through lorax?
Data moves through 6 stages: Request Reception → Validation → Adapter Loading → Batch Formation → Inference → .... Requests flow from clients through the router to the inference server, where models and adapters are dynamically loaded and applied during generation This pipeline design reflects a complex multi-stage processing system.
What technologies does lorax use?
The core stack includes Rust (HTTP server and request routing), Python (Inference engine and model loading), PyTorch (Deep learning framework), Axum (Rust web framework), Tonic (gRPC implementation), HuggingFace Hub (Model and adapter repository), and 4 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does lorax have?
lorax exhibits 4 data pools (Adapter Cache, Request Queue), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does lorax use?
4 design patterns detected: Dynamic Adapter Loading, Heterogeneous Batching, SGMV Kernel Optimization, Adapter Exchange.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.