assafelovic/gpt-researcher

An autonomous agent that conducts deep research on any data using any LLM providers

26,567 stars Python 9 components

Orchestrates multi-agent LLM research by querying web sources and generating comprehensive reports

Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 9-component ml inference. 284 files analyzed. Data flows through 8 distinct pipeline stages.

How Data Flows Through the System

Accept research query — FastAPI server receives POST request with research task, validates ResearchRequest schema including task description, report type, source preferences, and tone settings
Initialize GPTResearcher — Creates GPTResearcher instance with validated parameters, loads configuration from environment variables for LLM provider, API keys, and research settings [ResearchRequest → ResearchState] (config: OPENAI_API_KEY, TAVILY_API_KEY, LOGGING_LEVEL)
Generate subqueries — LLM analyzes main research task and breaks it into 3-5 specific subqueries using create_chat_completion, each targeting different aspects of the research topic [ResearchState → List of subqueries]
Parallel web retrieval — RetrieverFactory spawns multiple scrapers (Tavily, Google, DuckDuckGo) in parallel using asyncio.gather, each executing subqueries and collecting web sources with content extraction [List of subqueries → SourceData] (config: GOOGLE_API_KEY, IMAGE_GENERATION_ENABLED)
Process and analyze sources — DocumentProcessor cleans scraped content, extracts text from PDFs/HTML, while LLM analyzes each source for relevance and extracts key insights using configurable analysis prompts [SourceData → Analyzed content]
Synthesize research findings — ReportGenerator uses LLM to synthesize all analyzed content into coherent report sections following configured template structure and tone settings from ResearchRequest [Analyzed content → ResearchState]
Generate final report — LLM creates formatted markdown report with introduction, main sections, conclusion, and citations, then converts to PDF/DOCX using write_md_to_pdf and write_md_to_word utilities [ResearchState → Final report]
Stream research progress — WebSocketManager broadcasts real-time updates throughout the pipeline using JSON messages with type, content, and metadata fields to connected frontend clients [ChatMessage → Stream updates]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ResearchRequest backend/server/app.py
Pydantic model with task: str, report_type: str, report_source: str, tone: str, headers: dict, repo_name: str, branch_name: str, generate_in_background: bool
Created from incoming HTTP requests, validated by Pydantic, then passed to research orchestrator for task execution

ChatMessage frontend/nextjs/types/data.ts
TypeScript interface with role: 'user'|'assistant'|'system', content: string, timestamp?: number, metadata?: any
Generated during research streaming to show progress updates, stored in frontend state, and displayed in chat interface

ResearchState backend/memory/research.py
TypedDict with task: dict, initial_research: str, sections: List[str], research_data: List[dict], title: str, headers: dict, date: str, table_of_contents: str, introduction: str, conclusion: str, sources: List[str], report: str
Accumulates research findings across the pipeline, starting with task definition and building up sections until final report synthesis

SourceData frontend/nextjs/components/ResearchBlocks/Sources.tsx
Object with name: string, url: string representing scraped web sources with extracted content
Created during web retrieval phase, validated for accessibility, then displayed in frontend with domain extraction and link formatting

ChatBoxSettings frontend/nextjs/types/data.ts
Interface with report_type: string, report_source: string, tone: string, domains: string[], defaultReportType: string, layoutType: string, mcp_enabled: boolean, mcp_configs: MCPConfig[], mcp_strategy?: string
Configured by user in frontend settings, passed to backend to control research behavior like source selection and LLM tone

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Contract unguarded

The 'headers' field in ResearchRequest model is optional (dict | None), but downstream code expects it to contain authentication tokens or API keys for external services without validation

If this fails: If headers dict is missing required API keys for configured retrievers (like 'Authorization' for custom APIs), the research pipeline fails silently or produces empty results without clear error messages

backend/server/app.py:ResearchRequest

critical Ordering unguarded

The progress callback function expects progress object to have specific attributes (current_depth, total_depth, current_breadth, total_breadth, completed_queries, total_queries, current_query) in a particular sequence

If this fails: If GPTResearcher's progress tracking changes its data structure or adds new required fields, the callback crashes with AttributeError during deep research execution

backend/report_type/deep_research/main.py:on_progress

critical Resource unguarded

Discord messages can be chunked into 1500-character pieces and sent sequentially without considering Discord's rate limits (50 requests per second per guild)

If this fails: Large research reports cause the bot to exceed Discord's rate limits, resulting in HTTP 429 errors and incomplete message delivery to users

docs/discord-bot/index.js:splitMessage

warning Temporal weakly guarded

The cooldown mechanism uses Date.now() and assumes system clock is monotonic and doesn't account for server restarts clearing the in-memory cooldowns object

If this fails: After bot restarts, all cooldown timers reset, allowing spam in help forums immediately instead of respecting the 30-minute intervals

docs/discord-bot/index.js:cooldowns

warning Environment weakly guarded

WebSocket protocol determination relies on simple string matching of 'https' in host URL, but doesn't handle edge cases like custom ports or IP addresses with SSL

If this fails: Connecting to HTTPS endpoints on non-standard ports or SSL-enabled IP addresses fails with incorrect protocol selection (ws:// vs wss://)

docs/npm/index.js:initializeWebSocket

critical Scale unguarded

The InMemoryVectorStore can hold all research context and chat history without memory limits or cleanup strategies

If this fails: During long research sessions or multiple concurrent users, memory usage grows unbounded until the server runs out of RAM and crashes

backend/chat/chat.py:InMemoryVectorStore

warning Contract unguarded

The sendMessage function expects either 'task' parameter OR both 'query' and 'moreContext' parameters, with no validation to ensure exactly one pattern is used

If this fails: When both 'task' and 'query' are provided, the function creates malformed requests by concatenating query with moreContext and ignoring task, leading to unexpected research behavior

docs/npm/index.js:sendMessage

critical Domain weakly guarded

The sanitize_filename function (referenced but not shown) properly handles all Unicode characters, path traversal attacks, and filesystem-specific restrictions across different operating systems

If this fails: Malicious research tasks with crafted filenames could write reports outside the outputs/ directory or crash the system on Windows with reserved names like 'CON' or 'NUL'

backend/server/app.py:sanitize_filename

warning Temporal unguarded

The WebSocket response callbacks map uses hardcoded key 'current' and assumes only one active request per GPTResearcher instance at a time

If this fails: Concurrent research requests on the same GPTResearcher instance overwrite each other's callbacks, causing responses to be delivered to wrong handlers or lost entirely

docs/npm/index.js:responseCallbacks

warning Environment unguarded

The Express server listens on hardcoded port 5000 without checking if the port is already in use or configurable through environment variables

If this fails: In containerized environments or when port 5000 is occupied, the server fails to start with EADDRINUSE error, causing the Discord bot to become unreachable

docs/discord-bot/server.js:keepAlive

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Vector Memory Store (in-memory)
InMemoryVectorStore holds research context and chat history as embeddings for similarity search during follow-up questions

Research State Accumulator (state-store)
Accumulates research findings, sources, sections, and metadata throughout the research pipeline until final report generation

Output Files Directory (file-store)
Stores generated PDF, DOCX, and JSON report files with sanitized filenames for user download

WebSocket Connection Pool (registry)
Maintains active WebSocket connections for real-time progress streaming with connection lifecycle management

Feedback Loops

Iterative Research Refinement (recursive, reinforcing) — Trigger: Deep research mode enabled in multi_agents/agent.py. Action: Analyzes current research quality, identifies gaps, generates additional targeted queries, and retrieves more sources. Exit: Research depth threshold met or maximum iterations reached.
Source Validation Loop (retry, balancing) — Trigger: Web scraping failures or invalid URLs in retrieval phase. Action: Retries failed requests with exponential backoff, filters broken sources, and attempts alternative retrievers. Exit: Successful content extraction or retry limit exceeded.
Real-time Progress Updates (polling, reinforcing) — Trigger: Active WebSocket connections during research execution. Action: Continuously broadcasts pipeline progress, current subquery, source count, and completion status to frontend clients. Exit: Research completed or client disconnection.

Delays

Parallel Web Scraping (async-processing, ~5-30 seconds) — Multiple web sources scraped concurrently with asyncio.gather while streaming intermediate results to maintain user engagement
LLM Response Generation (async-processing, ~2-10 seconds per call) — Analysis and report synthesis steps wait for LLM completions with streaming responses when supported by the provider
Document Format Conversion (batch-window, ~1-3 seconds) — PDF and DOCX generation happens after report completion using md2pdf and python-docx libraries

Control Points

LLM Provider Selection (env-var) — Controls: Which LLM service handles query analysis and report generation (OpenAI, Anthropic, Google, etc.). Default: openai
Research Source Configuration (runtime-toggle) — Controls: Which web sources are enabled for retrieval (web, local, hybrid) affecting scope and speed. Default: web
Report Type Strategy (runtime-toggle) — Controls: Research methodology (research_report, detailed_report, deep research) determining depth and iteration count. Default: research_report
Image Generation Toggle (feature-flag) — Controls: Whether to generate and include images in research reports using external image generation services. Default: false
Logging Verbosity (env-var) — Controls: Detail level of system logs affecting debugging visibility and performance monitoring. Default: INFO

Technology Stack

FastAPI (framework)
Provides HTTP/WebSocket API endpoints for research requests with automatic OpenAPI documentation and Pydantic validation

Next.js (framework)
Renders interactive research interface with real-time updates, markdown rendering, and responsive design for desktop/mobile access

LangGraph (framework)
Orchestrates multi-agent research workflows with state management, conditional routing, and cycle detection for complex research patterns

LangChain (library)
Provides LLM abstractions, text splitting for RAG, and unified interfaces for different AI providers with function calling support

BeautifulSoup4 (library)
Extracts and cleans text content from scraped HTML pages while handling malformed markup and character encoding issues

Tavily (library)
Primary web search and scraping service for retrieving current information with built-in content extraction and relevance scoring

Pydantic (library)
Validates API request/response schemas and configuration models with automatic type conversion and error reporting

WebSockets (library)
Enables real-time bidirectional communication between frontend and backend for streaming research progress and chat interactions

Docker (infra)
Containerizes the application with multi-service orchestration for consistent deployment across development and production environments

Key Components

GPTResearcher (orchestrator) — Main research coordinator that breaks queries into subqueries, manages parallel retrieval across multiple sources, and synthesizes findings into coherent reports using configurable LLM providers gpt_researcher/__init__.py
WebSocketManager (gateway) — Manages real-time WebSocket connections for streaming research progress, handling client connections, message broadcasting, and graceful disconnection backend/server/websocket_manager.py
ChatAgentWithMemory (processor) — Processes follow-up questions about research reports using RAG with in-memory vector store, supporting conversation continuity and contextual search backend/chat/chat.py
RetrieverFactory (factory) — Creates appropriate web scrapers (Tavily, Google, DuckDuckGo, ArXiv, etc.) based on research source configuration and manages parallel retrieval execution gpt_researcher/retrievers/
LLMProvider (adapter) — Unified interface for multiple LLM providers (OpenAI, Anthropic, Google, etc.) with consistent chat completion and function calling capabilities gpt_researcher/llm_provider/
DocumentProcessor (transformer) — Converts scraped web content into structured text, handles PDF extraction, and processes various document formats for downstream analysis gpt_researcher/document/
ReportGenerator (processor) — Synthesizes research findings into formatted reports using configurable templates, tone settings, and output formats (markdown, PDF, DOCX) gpt_researcher/skills/
MemoryManager (store) — Manages conversation history and research context using vector embeddings for similarity search and contextual retrieval during follow-up questions gpt_researcher/memory/
MultiAgentOrchestrator (orchestrator) — Coordinates specialized research agents using LangGraph workflows, implementing complex research patterns like iterative refinement and multi-perspective analysis multi_agents/agent.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is gpt-researcher used for?

Orchestrates multi-agent LLM research by querying web sources and generating comprehensive reports assafelovic/gpt-researcher is a 9-component ml inference written in Python. Data flows through 8 distinct pipeline stages. The codebase contains 284 files.

How is gpt-researcher architected?

gpt-researcher is organized into 5 architecture layers: Frontend Interface, API Orchestration, Research Engine, Multi-Agent System, and 1 more. Data flows through 8 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through gpt-researcher?

Data moves through 8 stages: Accept research query → Initialize GPTResearcher → Generate subqueries → Parallel web retrieval → Process and analyze sources → .... Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients. This pipeline design reflects a complex multi-stage processing system.

What technologies does gpt-researcher use?

The core stack includes FastAPI (Provides HTTP/WebSocket API endpoints for research requests with automatic OpenAPI documentation and Pydantic validation), Next.js (Renders interactive research interface with real-time updates, markdown rendering, and responsive design for desktop/mobile access), LangGraph (Orchestrates multi-agent research workflows with state management, conditional routing, and cycle detection for complex research patterns), LangChain (Provides LLM abstractions, text splitting for RAG, and unified interfaces for different AI providers with function calling support), BeautifulSoup4 (Extracts and cleans text content from scraped HTML pages while handling malformed markup and character encoding issues), Tavily (Primary web search and scraping service for retrieving current information with built-in content extraction and relevance scoring), and 3 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does gpt-researcher have?

gpt-researcher exhibits 4 data pools (Vector Memory Store, Research State Accumulator), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does gpt-researcher use?

5 design patterns detected: Agent Orchestration, Retrieval-Augmented Generation, Progressive Enhancement, Stream Processing, Plugin Architecture.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.