berriai/litellm

Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]

43,933 stars Python 7 components

Routes 100+ LLM API calls through a unified gateway with cost tracking and access controls

Requests enter through the proxy server's HTTP endpoints, get authenticated against the database, then pass through the router which selects an available LLM provider. The core completion function transforms the request to the provider's format, makes the API call, normalizes the response back to OpenAI format, and returns it through middleware that handles logging, caching, and cost tracking. Enterprise hooks can intercept at multiple points for content moderation and compliance.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 7-component repository. 5169 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

HTTP request ingestion — FastAPI proxy server receives OpenAI-compatible requests at /chat/completions and other endpoints, parsing JSON body into request objects
Authentication and authorization — Proxy extracts API key from Authorization header, queries PrismaClient to validate key and load UserAPIKeyAuth with permissions and budget limits [ChatCompletionRequest → UserAPIKeyAuth] (config: general_settings.master_key)
Router model selection — Router examines model field in request, applies routing strategy (round-robin, least-latency, etc.) to select from available deployments in model_list [ChatCompletionRequest → selected provider config] (config: model_list, litellm_settings.routing_strategy)
Provider API transformation — LLMProvider subclass converts OpenAI-format request to provider's native format, handling authentication headers and parameter mapping [ChatCompletionRequest → provider-specific request] (config: litellm_settings.drop_params)
LLM API call execution — HTTP client makes actual API call to selected LLM provider (OpenAI, Anthropic, etc.) with transformed request [provider-specific request → provider-specific response] (config: litellm_settings.request_timeout, litellm_settings.num_retries)
Response normalization — Provider adapter converts response back to OpenAI ModelResponse format, standardizing field names and structures across all providers [provider-specific response → ModelResponse]
Apply response middleware — CustomLogger callbacks process the response for cost tracking, usage logging, caching, and enterprise guardrails before returning to client [ModelResponse → logged ModelResponse] (config: litellm_settings.success_callback, litellm_settings.cache)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ChatCompletionRequest litellm/types/utils.py
OpenAI-compatible dict with model: str, messages: List[dict], temperature: float, max_tokens: int, plus provider-specific parameters
Created from incoming HTTP request, normalized to OpenAI format, then transformed to provider-specific format before API call

ModelResponse litellm/types/utils.py
Standardized response with choices: List[Choice], usage: Usage dict, model: str, created: int timestamp
Generated from provider API response, normalized to OpenAI format, then passed through logging and caching middleware

UserAPIKeyAuth litellm/proxy/_types.py
Pydantic model with api_key: str, user_id: str, team_id: str, permissions: dict, budget limits and usage tracking fields
Loaded from database during auth, cached in memory, updated with usage tracking after each request

RouterConfig litellm/router_utils/router_config.py
Dict with model_list: List[ModelConfig], routing_strategy: str, fallback_models: List[str], retry_policy: dict
Loaded from YAML config file at startup, parsed into router data structures for request routing decisions

ProxyConfig litellm/proxy/_types.py
Configuration object with model_list, general_settings, litellm_settings, and environment-specific parameters from YAML files
Parsed from proxy_server_config.yaml at startup, drives all proxy server behavior and feature enablement

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Contract unguarded

The core completion function assumes all provider-specific LLMProvider classes implement the same interface for request transformation and response normalization, but there's no abstract base class or validation to enforce this contract

If this fails: When new providers are added with missing or incorrectly named methods, requests silently fail with AttributeError or return malformed responses that break downstream consumers

litellm/main.py:completion

critical Shape weakly guarded

Router assumes model_list configuration contains deployments with consistent structure (model_name, litellm_params, etc.) but only validates top-level dict existence, not required nested fields

If this fails: Missing required fields like api_key or api_base in deployment configs cause KeyError crashes during actual API calls, not during configuration validation

litellm/router.py:Router

critical Domain unguarded

Authentication middleware assumes API keys in Authorization header follow 'Bearer sk-...' or 'sk-...' format but doesn't validate the actual key structure or length before database queries

If this fails: Malformed API keys cause expensive database scans or SQL injection vulnerabilities if the key contains special characters that aren't properly escaped

litellm/proxy/proxy_server.py:authentication

critical Temporal unguarded

Cache system assumes cached ModelResponse objects remain valid and compatible with current response schema, but doesn't version cache entries or validate schema on retrieval

If this fails: When ModelResponse structure changes between versions, clients receive cached responses with missing or wrongly-typed fields, causing silent data corruption

litellm/caching/caching.py:DualCache

warning Resource unguarded

Model health tracking assumes in-memory success/failure counters won't overflow or consume unbounded memory, but doesn't implement cleanup for inactive models or cap the number of tracked deployments

If this fails: Long-running proxy servers with many model deployments experience memory leaks as health metrics accumulate indefinitely, eventually causing OOM crashes

litellm/router.py:health_tracking

warning Ordering unguarded

Database operations assume UserAPIKeyAuth records are updated atomically for usage tracking, but concurrent requests can race to update the same user's budget/usage counters

If this fails: Multiple simultaneous requests from the same API key can cause budget enforcement to fail, allowing users to exceed spending limits until the next database sync

litellm/proxy/utils.py:PrismaClient

warning Scale unguarded

Request transformation assumes message content and parameters fit within reasonable size limits, but doesn't validate total request payload size before sending to LLM providers

If this fails: Extremely large requests (multi-MB prompts) get sent to providers that reject them with cryptic errors, wasting API quota and causing confusing timeouts

litellm/main.py:completion

warning Environment unguarded

Provider-specific classes assume environment variables and API keys are available when making requests, but don't validate credentials are valid or have sufficient permissions until the actual API call

If this fails: Invalid or expired provider API keys cause authentication failures that surface as generic HTTP 401/403 errors, making it hard to diagnose which specific provider credential is broken

litellm/llms/*/LLMProvider

warning Contract unguarded

Success callbacks in CustomLogger assume the ModelResponse object passed to them is complete and immutable, but there's no enforcement preventing callbacks from modifying the response object

If this fails: Poorly written logging callbacks can accidentally modify response data, causing later callbacks or the final client response to contain corrupted usage metrics or response content

litellm/integrations/custom_logger.py:CustomLogger

warning Temporal unguarded

Failover logic assumes fallback models in the configuration are always available and healthy when primary models fail, but doesn't validate fallback model health before attempting the retry

If this fails: When primary and all fallback models are simultaneously unhealthy, requests fail with the last fallback's error message instead of a clear 'all models unavailable' error

litellm/router.py:fallback_models

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Response cache (cache)
DualCache stores LLM responses with TTL to avoid repeat API calls for identical requests

User database (database)
PostgreSQL/SQLite stores API keys, user permissions, team configurations, and usage metrics

Model health tracking (in-memory)
Router maintains success/failure rates and latency metrics for each model deployment

Configuration state (in-memory)
Loaded YAML config drives all proxy behavior including model routing and feature flags

Feedback Loops

Model health feedback (self-correction, balancing) — Trigger: API call failure or high latency. Action: Router decreases priority score for failing model, increases priority for healthy alternatives. Exit: Model recovery detected by successful calls.
Retry with backoff (retry, balancing) — Trigger: LLM API call returns 5xx error or rate limit. Action: Exponential backoff delay then retry with same or fallback model. Exit: Success response or max retries exceeded.
Usage budget enforcement (circuit-breaker, balancing) — Trigger: User exceeds configured spend limits. Action: Block further requests from that API key until budget reset. Exit: Admin increases budget or billing cycle resets.

Delays

LLM API latency (async-processing, ~100ms-30s depending on model and prompt length) — Client waits for streaming or complete response from external LLM provider
Database query delay (async-processing, ~1-100ms) — Authentication check blocks request processing until user permissions retrieved
Cache TTL expiration (cache-ttl, ~configurable, typically 1-60 minutes) — Cached responses expire and next identical request hits LLM API instead of cache

Control Points

Routing strategy (architecture-switch) — Controls: How requests are distributed across model deployments (round-robin, least-latency, usage-based). Default: usage-based-routing-v2
Drop params (feature-flag) — Controls: Whether to remove unsupported parameters before sending to LLM providers. Default: true
Request timeout (threshold) — Controls: Maximum time to wait for LLM API response before timing out. Default: 600
Master key (env-var) — Controls: Admin authentication bypass for proxy server management. Default: sk-1234
Telemetry (feature-flag) — Controls: Whether to send anonymous usage data to LiteLLM team. Default: false

Technology Stack

FastAPI (framework)
HTTP server framework for the proxy gateway, handling OpenAI-compatible REST endpoints

Prisma (database)
Database ORM for user management, API key storage, and usage tracking in the proxy server

Redis (database)
L2 cache backend in DualCache system for storing LLM responses and configuration data

httpx (library)
Async HTTP client for making API calls to 100+ LLM providers with retry and timeout handling

Pydantic (library)
Data validation and serialization for request/response models and configuration schemas

Docker (infra)
Containerization for proxy server deployment with hardened security configurations

PostgreSQL (database)
Primary database backend for multi-tenant proxy deployments with full ACID compliance

Key Components

completion (adapter) — Core function that normalizes any LLM API call to OpenAI format - handles provider detection, request transformation, and response normalization litellm/main.py
Router (orchestrator) — Load balances requests across multiple LLM deployments, handles failovers, tracks model health, and applies routing strategies litellm/router.py
ProxyServer (gateway) — FastAPI server that exposes OpenAI-compatible endpoints, handles authentication, logging, caching, and team management litellm/proxy/proxy_server.py
LLMProvider (adapter) — Provider-specific implementations that translate between OpenAI format and each LLM's native API format (hundreds of classes) litellm/llms/
CustomLogger (processor) — Base class for logging callbacks that track requests, responses, costs, and errors across multiple destinations litellm/integrations/custom_logger.py
DualCache (store) — Two-tier caching system (in-memory + Redis) that caches LLM responses and configuration data to reduce API calls litellm/caching/caching.py
PrismaClient (store) — Database client that manages user authentication, API keys, usage tracking, team permissions, and audit logs litellm/proxy/utils.py

Package Structure

litellm-core (app)
Core SDK and proxy server implementation that handles LLM API unification, routing, and middleware.

enterprise (library)
Enterprise-specific hooks and guardrails for advanced security and compliance features.

litellm-proxy-extras (tooling)
Database utilities and deployment helpers for the proxy server infrastructure.

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Compare litellm

Related Repository Repositories

Frequently Asked Questions

What is litellm used for?

Routes 100+ LLM API calls through a unified gateway with cost tracking and access controls berriai/litellm is a 7-component repository written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 5169 files.

How is litellm architected?

litellm is organized into 4 architecture layers: LLM Interface Layer, Proxy Gateway Layer, Router & Load Balancer, Enterprise Security. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through litellm?

Data moves through 7 stages: HTTP request ingestion → Authentication and authorization → Router model selection → Provider API transformation → LLM API call execution → .... Requests enter through the proxy server's HTTP endpoints, get authenticated against the database, then pass through the router which selects an available LLM provider. The core completion function transforms the request to the provider's format, makes the API call, normalizes the response back to OpenAI format, and returns it through middleware that handles logging, caching, and cost tracking. Enterprise hooks can intercept at multiple points for content moderation and compliance. This pipeline design reflects a complex multi-stage processing system.

What technologies does litellm use?

The core stack includes FastAPI (HTTP server framework for the proxy gateway, handling OpenAI-compatible REST endpoints), Prisma (Database ORM for user management, API key storage, and usage tracking in the proxy server), Redis (L2 cache backend in DualCache system for storing LLM responses and configuration data), httpx (Async HTTP client for making API calls to 100+ LLM providers with retry and timeout handling), Pydantic (Data validation and serialization for request/response models and configuration schemas), Docker (Containerization for proxy server deployment with hardened security configurations), and 1 more. A focused set of dependencies that keeps the build manageable.

What system dynamics does litellm have?

litellm exhibits 4 data pools (Response cache, User database), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle self-correction and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does litellm use?

4 design patterns detected: Provider Adapter Pattern, Plugin Hook System, Multi-tier Caching, Config-driven Architecture.

How does litellm compare to alternatives?

CodeSea has side-by-side architecture comparisons of litellm with vllm. These comparisons show tech stack differences, pipeline design, system behavior, and code patterns. See the comparison pages above for detailed analysis.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.

berriai/litellm

How Data Flows Through the System

Data Models

Hidden Assumptions

System Behavior

Data Pools

Feedback Loops

Delays

Control Points

Technology Stack

Key Components

Package Structure

Explore the interactive analysis

Compare litellm

litellm vs Vllm

Related Repository Repositories

significant-gravitas/autogpt

ollama/ollama

langflow-ai/langflow

langchain-ai/langchain

ggml-org/llama.cpp

instructkr/claw-code

Frequently Asked Questions