berriai/litellm
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]
Routes 100+ LLM API calls through a unified gateway with cost tracking and access controls
Requests enter through the proxy server's HTTP endpoints, get authenticated against the database, then pass through the router which selects an available LLM provider. The core completion function transforms the request to the provider's format, makes the API call, normalizes the response back to OpenAI format, and returns it through middleware that handles logging, caching, and cost tracking. Enterprise hooks can intercept at multiple points for content moderation and compliance.
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 7-component repository. 5169 files analyzed. Data flows through 7 distinct pipeline stages.
How Data Flows Through the System
Requests enter through the proxy server's HTTP endpoints, get authenticated against the database, then pass through the router which selects an available LLM provider. The core completion function transforms the request to the provider's format, makes the API call, normalizes the response back to OpenAI format, and returns it through middleware that handles logging, caching, and cost tracking. Enterprise hooks can intercept at multiple points for content moderation and compliance.
- HTTP request ingestion — FastAPI proxy server receives OpenAI-compatible requests at /chat/completions and other endpoints, parsing JSON body into request objects
- Authentication and authorization — Proxy extracts API key from Authorization header, queries PrismaClient to validate key and load UserAPIKeyAuth with permissions and budget limits [ChatCompletionRequest → UserAPIKeyAuth] (config: general_settings.master_key)
- Router model selection — Router examines model field in request, applies routing strategy (round-robin, least-latency, etc.) to select from available deployments in model_list [ChatCompletionRequest → selected provider config] (config: model_list, litellm_settings.routing_strategy)
- Provider API transformation — LLMProvider subclass converts OpenAI-format request to provider's native format, handling authentication headers and parameter mapping [ChatCompletionRequest → provider-specific request] (config: litellm_settings.drop_params)
- LLM API call execution — HTTP client makes actual API call to selected LLM provider (OpenAI, Anthropic, etc.) with transformed request [provider-specific request → provider-specific response] (config: litellm_settings.request_timeout, litellm_settings.num_retries)
- Response normalization — Provider adapter converts response back to OpenAI ModelResponse format, standardizing field names and structures across all providers [provider-specific response → ModelResponse]
- Apply response middleware — CustomLogger callbacks process the response for cost tracking, usage logging, caching, and enterprise guardrails before returning to client [ModelResponse → logged ModelResponse] (config: litellm_settings.success_callback, litellm_settings.cache)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
litellm/types/utils.pyOpenAI-compatible dict with model: str, messages: List[dict], temperature: float, max_tokens: int, plus provider-specific parameters
Created from incoming HTTP request, normalized to OpenAI format, then transformed to provider-specific format before API call
litellm/types/utils.pyStandardized response with choices: List[Choice], usage: Usage dict, model: str, created: int timestamp
Generated from provider API response, normalized to OpenAI format, then passed through logging and caching middleware
litellm/proxy/_types.pyPydantic model with api_key: str, user_id: str, team_id: str, permissions: dict, budget limits and usage tracking fields
Loaded from database during auth, cached in memory, updated with usage tracking after each request
litellm/router_utils/router_config.pyDict with model_list: List[ModelConfig], routing_strategy: str, fallback_models: List[str], retry_policy: dict
Loaded from YAML config file at startup, parsed into router data structures for request routing decisions
litellm/proxy/_types.pyConfiguration object with model_list, general_settings, litellm_settings, and environment-specific parameters from YAML files
Parsed from proxy_server_config.yaml at startup, drives all proxy server behavior and feature enablement
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The core completion function assumes all provider-specific LLMProvider classes implement the same interface for request transformation and response normalization, but there's no abstract base class or validation to enforce this contract
If this fails: When new providers are added with missing or incorrectly named methods, requests silently fail with AttributeError or return malformed responses that break downstream consumers
litellm/main.py:completion
Router assumes model_list configuration contains deployments with consistent structure (model_name, litellm_params, etc.) but only validates top-level dict existence, not required nested fields
If this fails: Missing required fields like api_key or api_base in deployment configs cause KeyError crashes during actual API calls, not during configuration validation
litellm/router.py:Router
Authentication middleware assumes API keys in Authorization header follow 'Bearer sk-...' or 'sk-...' format but doesn't validate the actual key structure or length before database queries
If this fails: Malformed API keys cause expensive database scans or SQL injection vulnerabilities if the key contains special characters that aren't properly escaped
litellm/proxy/proxy_server.py:authentication
Cache system assumes cached ModelResponse objects remain valid and compatible with current response schema, but doesn't version cache entries or validate schema on retrieval
If this fails: When ModelResponse structure changes between versions, clients receive cached responses with missing or wrongly-typed fields, causing silent data corruption
litellm/caching/caching.py:DualCache
Model health tracking assumes in-memory success/failure counters won't overflow or consume unbounded memory, but doesn't implement cleanup for inactive models or cap the number of tracked deployments
If this fails: Long-running proxy servers with many model deployments experience memory leaks as health metrics accumulate indefinitely, eventually causing OOM crashes
litellm/router.py:health_tracking
Database operations assume UserAPIKeyAuth records are updated atomically for usage tracking, but concurrent requests can race to update the same user's budget/usage counters
If this fails: Multiple simultaneous requests from the same API key can cause budget enforcement to fail, allowing users to exceed spending limits until the next database sync
litellm/proxy/utils.py:PrismaClient
Request transformation assumes message content and parameters fit within reasonable size limits, but doesn't validate total request payload size before sending to LLM providers
If this fails: Extremely large requests (multi-MB prompts) get sent to providers that reject them with cryptic errors, wasting API quota and causing confusing timeouts
litellm/main.py:completion
Provider-specific classes assume environment variables and API keys are available when making requests, but don't validate credentials are valid or have sufficient permissions until the actual API call
If this fails: Invalid or expired provider API keys cause authentication failures that surface as generic HTTP 401/403 errors, making it hard to diagnose which specific provider credential is broken
litellm/llms/*/LLMProvider
Success callbacks in CustomLogger assume the ModelResponse object passed to them is complete and immutable, but there's no enforcement preventing callbacks from modifying the response object
If this fails: Poorly written logging callbacks can accidentally modify response data, causing later callbacks or the final client response to contain corrupted usage metrics or response content
litellm/integrations/custom_logger.py:CustomLogger
Failover logic assumes fallback models in the configuration are always available and healthy when primary models fail, but doesn't validate fallback model health before attempting the retry
If this fails: When primary and all fallback models are simultaneously unhealthy, requests fail with the last fallback's error message instead of a clear 'all models unavailable' error
litellm/router.py:fallback_models
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
DualCache stores LLM responses with TTL to avoid repeat API calls for identical requests
PostgreSQL/SQLite stores API keys, user permissions, team configurations, and usage metrics
Router maintains success/failure rates and latency metrics for each model deployment
Loaded YAML config drives all proxy behavior including model routing and feature flags
Feedback Loops
- Model health feedback (self-correction, balancing) — Trigger: API call failure or high latency. Action: Router decreases priority score for failing model, increases priority for healthy alternatives. Exit: Model recovery detected by successful calls.
- Retry with backoff (retry, balancing) — Trigger: LLM API call returns 5xx error or rate limit. Action: Exponential backoff delay then retry with same or fallback model. Exit: Success response or max retries exceeded.
- Usage budget enforcement (circuit-breaker, balancing) — Trigger: User exceeds configured spend limits. Action: Block further requests from that API key until budget reset. Exit: Admin increases budget or billing cycle resets.
Delays
- LLM API latency (async-processing, ~100ms-30s depending on model and prompt length) — Client waits for streaming or complete response from external LLM provider
- Database query delay (async-processing, ~1-100ms) — Authentication check blocks request processing until user permissions retrieved
- Cache TTL expiration (cache-ttl, ~configurable, typically 1-60 minutes) — Cached responses expire and next identical request hits LLM API instead of cache
Control Points
- Routing strategy (architecture-switch) — Controls: How requests are distributed across model deployments (round-robin, least-latency, usage-based). Default: usage-based-routing-v2
- Drop params (feature-flag) — Controls: Whether to remove unsupported parameters before sending to LLM providers. Default: true
- Request timeout (threshold) — Controls: Maximum time to wait for LLM API response before timing out. Default: 600
- Master key (env-var) — Controls: Admin authentication bypass for proxy server management. Default: sk-1234
- Telemetry (feature-flag) — Controls: Whether to send anonymous usage data to LiteLLM team. Default: false
Technology Stack
HTTP server framework for the proxy gateway, handling OpenAI-compatible REST endpoints
Database ORM for user management, API key storage, and usage tracking in the proxy server
L2 cache backend in DualCache system for storing LLM responses and configuration data
Async HTTP client for making API calls to 100+ LLM providers with retry and timeout handling
Data validation and serialization for request/response models and configuration schemas
Containerization for proxy server deployment with hardened security configurations
Primary database backend for multi-tenant proxy deployments with full ACID compliance
Key Components
- completion (adapter) — Core function that normalizes any LLM API call to OpenAI format - handles provider detection, request transformation, and response normalization
litellm/main.py - Router (orchestrator) — Load balances requests across multiple LLM deployments, handles failovers, tracks model health, and applies routing strategies
litellm/router.py - ProxyServer (gateway) — FastAPI server that exposes OpenAI-compatible endpoints, handles authentication, logging, caching, and team management
litellm/proxy/proxy_server.py - LLMProvider (adapter) — Provider-specific implementations that translate between OpenAI format and each LLM's native API format (hundreds of classes)
litellm/llms/ - CustomLogger (processor) — Base class for logging callbacks that track requests, responses, costs, and errors across multiple destinations
litellm/integrations/custom_logger.py - DualCache (store) — Two-tier caching system (in-memory + Redis) that caches LLM responses and configuration data to reduce API calls
litellm/caching/caching.py - PrismaClient (store) — Database client that manages user authentication, API keys, usage tracking, team permissions, and audit logs
litellm/proxy/utils.py
Package Structure
Core SDK and proxy server implementation that handles LLM API unification, routing, and middleware.
Enterprise-specific hooks and guardrails for advanced security and compliance features.
Database utilities and deployment helpers for the proxy server infrastructure.
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaCompare litellm
Related Repository Repositories
Frequently Asked Questions
What is litellm used for?
Routes 100+ LLM API calls through a unified gateway with cost tracking and access controls berriai/litellm is a 7-component repository written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 5169 files.
How is litellm architected?
litellm is organized into 4 architecture layers: LLM Interface Layer, Proxy Gateway Layer, Router & Load Balancer, Enterprise Security. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through litellm?
Data moves through 7 stages: HTTP request ingestion → Authentication and authorization → Router model selection → Provider API transformation → LLM API call execution → .... Requests enter through the proxy server's HTTP endpoints, get authenticated against the database, then pass through the router which selects an available LLM provider. The core completion function transforms the request to the provider's format, makes the API call, normalizes the response back to OpenAI format, and returns it through middleware that handles logging, caching, and cost tracking. Enterprise hooks can intercept at multiple points for content moderation and compliance. This pipeline design reflects a complex multi-stage processing system.
What technologies does litellm use?
The core stack includes FastAPI (HTTP server framework for the proxy gateway, handling OpenAI-compatible REST endpoints), Prisma (Database ORM for user management, API key storage, and usage tracking in the proxy server), Redis (L2 cache backend in DualCache system for storing LLM responses and configuration data), httpx (Async HTTP client for making API calls to 100+ LLM providers with retry and timeout handling), Pydantic (Data validation and serialization for request/response models and configuration schemas), Docker (Containerization for proxy server deployment with hardened security configurations), and 1 more. A focused set of dependencies that keeps the build manageable.
What system dynamics does litellm have?
litellm exhibits 4 data pools (Response cache, User database), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle self-correction and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does litellm use?
4 design patterns detected: Provider Adapter Pattern, Plugin Hook System, Multi-tier Caching, Config-driven Architecture.
How does litellm compare to alternatives?
CodeSea has side-by-side architecture comparisons of litellm with vllm. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.