oobabooga/textgen

The original local LLM interface. Text, vision, tool-calling, training. UI + API, 100% offline and private.

46,796 stars Python 8 components

Runs large language models locally with web UI and API for text generation

User requests enter through the web UI or API endpoints, get validated and converted to internal formats, then flow through the text generation pipeline where models process prompts using configured sampling parameters. Extensions can modify inputs and outputs at each stage, while vector extensions inject relevant context from document stores. Generated text streams back to users with optional post-processing like TTS or translation.

Under the hood, the system uses 4 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 8-component fullstack with 0 connections. 110 files analyzed. Minimal connections — components operate mostly in isolation.

How Data Flows Through the System

User requests enter through the web UI or API endpoints, get validated and converted to internal formats, then flow through the text generation pipeline where models process prompts using configured sampling parameters. Extensions can modify inputs and outputs at each stage, while vector extensions inject relevant context from document stores. Generated text streams back to users with optional post-processing like TTS or translation.

  1. Request ingestion and validation — Server.py receives HTTP requests and routes them to appropriate handlers — chat requests go to /v1/chat/completions, text completions to /v1/completions. Pydantic models in modules/api/typing.py validate request schemas and extract parameters like prompt, temperature, max_tokens [HTTP requests → GenerationRequest]
  2. Extension input processing — ExtensionManager calls input_modifier hooks on active extensions — google_translate may translate the prompt to English, send_pictures processes attached images with BLIP captioning, superbooga injects relevant document chunks from vector search [GenerationRequest → Modified prompts] (config: activate, language string, chunk_count)
  3. Context building and formatting — ChatHandler in modules/chat.py applies character personas and chat templates, builds conversation history, and formats the final prompt according to the model's expected format (ChatML, Alpaca, etc.) [ChatMessage → Formatted prompts]
  4. Model inference and sampling — TextGenerationWebModel coordinates with the loaded backend (llama.cpp, Transformers) to generate tokens. LogitsProcessor extensions modify probability distributions, sampling parameters like temperature and top_p control randomness, and the model produces token streams [Formatted prompts → Generated text streams] (config: temperature, top_p, top_k +4)
  5. Extension output processing — Output_modifier hooks transform generated text — google_translate converts back to target language, coqui_tts generates speech audio, perplexity_colors adds HTML markup for token probability visualization [Generated text streams → Processed responses] (config: language string, voice, speaker +1)
  6. Response serialization and streaming — APIHandler formats responses according to OpenAI API spec with usage statistics, choice objects, and finish_reason. Streaming responses use Server-Sent Events (SSE) to deliver tokens incrementally to the client [Processed responses → API responses]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

GenerationRequest modules/api/typing.py
Pydantic model with prompt: str, sampling parameters (temperature, top_p, max_tokens, etc.), model selection fields, and optional tool definitions for function calling
Created from API requests, validated against schema, passed to generation engine with sampling configuration
ChatMessage modules/api/typing.py
dict with role: str ('user'|'assistant'|'system'), content: str or list (for multimodal), optional tool_calls and function_call fields
Received in API requests, stored in chat history, converted to model-specific prompt format during generation
ModelParameters modules/shared.py
Global args namespace containing model_name: str, loader: str, device settings, quantization options, and 40+ sampling parameters like temperature, dynatemp_range, min_p
Initialized from command line args and settings files, modified by API requests, used throughout generation pipeline
ExtensionState extensions/*/script.py
dict called 'params' with extension-specific configuration like activate: bool, model settings, and feature flags (e.g. TTS voice, translation language)
Loaded from extension scripts, modified through web UI, applied during text input/output processing
VectorChunk extensions/superboogav2/chromadb.py
Document chunk with text content, embedding vector, metadata dict containing source file and chunk ID, stored in ChromaDB collection
Created by splitting documents into chunks, embedded using sentence transformers, stored in vector database, retrieved for RAG context

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Domain unguarded

Assumes Hugging Face model repositories follow standard file naming patterns and contain valid model files, but only validates HTTP responses exist without checking file formats or model compatibility

If this fails: Downloads corrupt files or incompatible model formats that fail silently during loading, wasting bandwidth and storage

download-model.py:ModelDownloader
critical Resource weakly guarded

Assumes CUDA device has sufficient VRAM for TTS model loading based on torch.cuda.is_available() check, but never measures actual memory requirements or available capacity

If this fails: TTS model loading fails with CUDA OOM errors on GPUs with limited VRAM, causing extension to crash without fallback to CPU

extensions/coqui_tts/script.py:TTS
critical Environment unguarded

Assumes Google Translate API is always accessible and responsive without network timeout handling or offline detection

If this fails: Extension blocks indefinitely on translate calls when network is down, causing the entire generation pipeline to hang

extensions/google_translate/script.py:GoogleTranslator
critical Contract weakly guarded

Expects PIL Image objects from image uploads but only checks .convert('RGB') method exists, not validating image format or dimensions

If this fails: Processing corrupted image files or unsupported formats causes BlipProcessor to raise exceptions, crashing the chat interface

extensions/send_pictures/script.py:caption_image
critical Ordering unguarded

Assumes input_ids tensor grows monotonically during generation and can safely index with [-1] for last token, but streaming or batch processing may violate this

If this fails: IndexError when processing empty or malformed input_ids tensors, causing generation to fail with cryptic error messages

extensions/perplexity_colors/script.py:PerplexityLogits.__call__
warning Scale unguarded

Hardcodes newline token ID from tokenizer.encode('\n')[-1] assuming single token output, but different tokenizers may encode newlines as multiple tokens

If this fails: Wrong token gets suppressed, failing to enforce minimum length constraints and potentially biasing generation toward unexpected tokens

extensions/long_replies/script.py:MyLogits.__call__
warning Temporal unguarded

Assumes Stable Diffusion API server at hardcoded address 'http://127.0.0.1:7860' is always running and responsive without health checks

If this fails: Extension fails silently when SD server is down, leaving users with no image generation feedback or error indication

extensions/sd_api_pictures/script.py:params
warning Resource weakly guarded

Assumes DOM structure remains stable with specific element IDs ('gallery-extension', 'chat-mode') existing, but dynamic UI changes could break element queries

If this fails: JavaScript errors when elements are missing, breaking gallery visibility controls and potentially crashing the web interface

extensions/gallery/script.js:extensions_block
warning Domain unguarded

Creates bias_options.txt with hardcoded emotional state examples assuming these strings are valid bias patterns, but never validates format or model compatibility

If this fails: Bias strings may not match model's training format, causing unexpected generation behavior or no effect at all

extensions/character_bias/script.py:bias_file
info Environment unguarded

Sets COQUI_TOS_AGREED environment variable assuming this bypasses TOS prompts permanently, but library updates might change this behavior

If this fails: Future Coqui TTS versions might ignore this flag, causing interactive TOS prompts to block automated generation

extensions/coqui_tts/script.py:os.environ

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model cache (file-store)
Downloaded model files (GGUF, safetensors) stored locally with metadata and checksums for integrity verification
Chat history (in-memory)
Active conversation context maintained per session with message history, character state, and generation parameters
Vector database (database)
ChromaDB collection storing document embeddings for retrieval-augmented generation with semantic search capabilities
Extension state (in-memory)
Global parameters and extension configurations that persist across requests and control system behavior

Feedback Loops

Delays

Control Points

Technology Stack

Flask (framework)
Web server framework providing HTTP endpoints, static file serving, and WebSocket support for the UI and API
PyTorch/Transformers (compute)
Primary ML backend for loading and running transformer models with GPU acceleration and quantization
llama.cpp (runtime)
C++ inference engine for GGUF models providing CPU-optimized execution and reduced memory usage
ChromaDB (database)
Vector database for storing document embeddings and performing semantic search in RAG extensions
Gradio (framework)
UI component library generating the web interface for model interaction, parameter control, and extension management
Pydantic (library)
Data validation and serialization for API requests, ensuring type safety and OpenAI compatibility

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Fullstack Repositories

Frequently Asked Questions

What is textgen used for?

Runs large language models locally with web UI and API for text generation oobabooga/textgen is a 8-component fullstack written in Python. Minimal connections — components operate mostly in isolation. The codebase contains 110 files.

How is textgen architected?

textgen is organized into 4 architecture layers: Web Server & API, Model Management, Text Generation Engine, Extension Framework. Minimal connections — components operate mostly in isolation. This layered structure keeps concerns separated and modules independent.

How does data flow through textgen?

Data moves through 6 stages: Request ingestion and validation → Extension input processing → Context building and formatting → Model inference and sampling → Extension output processing → .... User requests enter through the web UI or API endpoints, get validated and converted to internal formats, then flow through the text generation pipeline where models process prompts using configured sampling parameters. Extensions can modify inputs and outputs at each stage, while vector extensions inject relevant context from document stores. Generated text streams back to users with optional post-processing like TTS or translation. This pipeline design reflects a complex multi-stage processing system.

What technologies does textgen use?

The core stack includes Flask (Web server framework providing HTTP endpoints, static file serving, and WebSocket support for the UI and API), PyTorch/Transformers (Primary ML backend for loading and running transformer models with GPU acceleration and quantization), llama.cpp (C++ inference engine for GGUF models providing CPU-optimized execution and reduced memory usage), ChromaDB (Vector database for storing document embeddings and performing semantic search in RAG extensions), Gradio (UI component library generating the web interface for model interaction, parameter control, and extension management), Pydantic (Data validation and serialization for API requests, ensuring type safety and OpenAI compatibility). A focused set of dependencies that keeps the build manageable.

What system dynamics does textgen have?

textgen exhibits 4 data pools (Model cache, Chat history), 4 feedback loops, 5 control points, 4 delays. The feedback loops handle retry and training-loop. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does textgen use?

4 design patterns detected: Extension Hook System, Backend Abstraction, Streaming Generation, Plugin Configuration.

Analyzed on April 20, 2026 by CodeSea. Written by .