predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

3,741 stars Python 11 components 17 connections

Multi-LoRA inference server for serving thousands of fine-tuned LLMs

Requests flow from clients through the router to the inference server, where models and adapters are dynamically loaded and applied during generation

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

Structural Verdict

A 11-component ml training with 17 connections. 218 files analyzed. Highly interconnected — components depend on each other heavily.

How Data Flows Through the System

Requests flow from clients through the router to the inference server, where models and adapters are dynamically loaded and applied during generation

  1. Request Reception — HTTP/gRPC requests received with adapter specifications (config: adapter_id, adapter_source, merged_adapters)
  2. Validation — Tokenization and parameter validation by ValidationWorker (config: max_input_length, max_new_tokens)
  3. Adapter Loading — Dynamic loading of LoRA adapters from hub, local, or S3 sources (config: adapter_source, api_token)
  4. Batch Formation — Heterogeneous batching of requests across different adapters (config: max_batch_total_tokens, max_concurrent_requests)
  5. Inference — Model forward pass with SGMV LoRA application and flash attention (config: do_sample, temperature, top_k +1)
  6. Generation — Token generation with stopping criteria and streaming support (config: stop, max_new_tokens, stream)

System Behavior

How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Adapter Cache (cache)
In-memory cache of loaded LoRA adapters with LRU eviction
Request Queue (queue)
Pending requests waiting for batching
GPU Memory Pool (buffer)
CUDA memory buffers for model weights and KV cache
Adapter Weights (state-store)
LoRA adapter parameters stored on GPU/CPU

Feedback Loops

Delays & Async Processing

Control Points

Technology Stack

Rust (framework)
HTTP server and request routing
Python (framework)
Inference engine and model loading
PyTorch (library)
Deep learning framework
Axum (framework)
Rust web framework
Tonic (library)
gRPC implementation
HuggingFace Hub (library)
Model and adapter repository
Flash Attention (library)
Optimized attention kernels
CUDA (infra)
GPU acceleration kernels
Pydantic (library)
Data validation and serialization
Docker (infra)
Containerization

Key Components

Sub-Modules

Python Client SDK (independence: high)
Python library for interacting with LoRAX server
Launcher (independence: high)
Server process management and configuration
Integration Tests (independence: medium)
End-to-end testing with Docker containers

Configuration

clients/python/lorax/types.py (python-pydantic)

clients/python/lorax/types.py (python-pydantic)

clients/python/lorax/types.py (python-pydantic)

clients/python/lorax/types.py (python-pydantic)

Science Pipeline

  1. Tokenization — Convert text to token IDs using HuggingFace tokenizer [string → (sequence_length,)] router/src/validation.rs
  2. Embedding Lookup — Token IDs to embeddings via embedding layer [(batch_size, sequence_length) → (batch_size, sequence_length, hidden_size)] server/lorax_server/models/flash_causal_lm.py
  3. LoRA Application — Apply adapter weights using SGMV kernel: x + (x @ A @ B) * scaling [(batch_size, sequence_length, hidden_size) → (batch_size, sequence_length, hidden_size)] server/lorax_server/layers/lora.py
  4. Attention — Multi-head attention with flash attention optimization [(batch_size, sequence_length, hidden_size) → (batch_size, sequence_length, hidden_size)] server/lorax_server/models/flash_causal_lm.py
  5. Token Generation — Logits to next token via sampling or greedy selection [(batch_size, vocab_size) → (batch_size, 1)] server/lorax_server/models/flash_causal_lm.py

Assumptions & Constraints

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is lorax used for?

Multi-LoRA inference server for serving thousands of fine-tuned LLMs predibase/lorax is a 11-component ml training written in Python. Highly interconnected — components depend on each other heavily. The codebase contains 218 files.

How is lorax architected?

lorax is organized into 4 architecture layers: Client Layer, Router Layer, Inference Engine, System Layer. Highly interconnected — components depend on each other heavily. This layered structure enables tight integration between components.

How does data flow through lorax?

Data moves through 6 stages: Request Reception → Validation → Adapter Loading → Batch Formation → Inference → .... Requests flow from clients through the router to the inference server, where models and adapters are dynamically loaded and applied during generation This pipeline design reflects a complex multi-stage processing system.

What technologies does lorax use?

The core stack includes Rust (HTTP server and request routing), Python (Inference engine and model loading), PyTorch (Deep learning framework), Axum (Rust web framework), Tonic (gRPC implementation), HuggingFace Hub (Model and adapter repository), and 4 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does lorax have?

lorax exhibits 4 data pools (Adapter Cache, Request Queue), 3 feedback loops, 5 control points, 4 delays. The feedback loops handle polling and auto-scale. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does lorax use?

4 design patterns detected: Dynamic Adapter Loading, Heterogeneous Batching, SGMV Kernel Optimization, Adapter Exchange.

Analyzed on March 31, 2026 by CodeSea. Written by .