assafelovic/gpt-researcher
An autonomous agent that conducts deep research on any data using any LLM providers
Orchestrates multi-agent LLM research by querying web sources and generating comprehensive reports
Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients.
Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.
A 9-component ml inference. 284 files analyzed. Data flows through 8 distinct pipeline stages.
How Data Flows Through the System
Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients.
- Accept research query — FastAPI server receives POST request with research task, validates ResearchRequest schema including task description, report type, source preferences, and tone settings
- Initialize GPTResearcher — Creates GPTResearcher instance with validated parameters, loads configuration from environment variables for LLM provider, API keys, and research settings [ResearchRequest → ResearchState] (config: OPENAI_API_KEY, TAVILY_API_KEY, LOGGING_LEVEL)
- Generate subqueries — LLM analyzes main research task and breaks it into 3-5 specific subqueries using create_chat_completion, each targeting different aspects of the research topic [ResearchState → List of subqueries]
- Parallel web retrieval — RetrieverFactory spawns multiple scrapers (Tavily, Google, DuckDuckGo) in parallel using asyncio.gather, each executing subqueries and collecting web sources with content extraction [List of subqueries → SourceData] (config: GOOGLE_API_KEY, IMAGE_GENERATION_ENABLED)
- Process and analyze sources — DocumentProcessor cleans scraped content, extracts text from PDFs/HTML, while LLM analyzes each source for relevance and extracts key insights using configurable analysis prompts [SourceData → Analyzed content]
- Synthesize research findings — ReportGenerator uses LLM to synthesize all analyzed content into coherent report sections following configured template structure and tone settings from ResearchRequest [Analyzed content → ResearchState]
- Generate final report — LLM creates formatted markdown report with introduction, main sections, conclusion, and citations, then converts to PDF/DOCX using write_md_to_pdf and write_md_to_word utilities [ResearchState → Final report]
- Stream research progress — WebSocketManager broadcasts real-time updates throughout the pipeline using JSON messages with type, content, and metadata fields to connected frontend clients [ChatMessage → Stream updates]
Data Models
The data structures that flow between stages — the contracts that hold the system together.
backend/server/app.pyPydantic model with task: str, report_type: str, report_source: str, tone: str, headers: dict, repo_name: str, branch_name: str, generate_in_background: bool
Created from incoming HTTP requests, validated by Pydantic, then passed to research orchestrator for task execution
frontend/nextjs/types/data.tsTypeScript interface with role: 'user'|'assistant'|'system', content: string, timestamp?: number, metadata?: any
Generated during research streaming to show progress updates, stored in frontend state, and displayed in chat interface
backend/memory/research.pyTypedDict with task: dict, initial_research: str, sections: List[str], research_data: List[dict], title: str, headers: dict, date: str, table_of_contents: str, introduction: str, conclusion: str, sources: List[str], report: str
Accumulates research findings across the pipeline, starting with task definition and building up sections until final report synthesis
frontend/nextjs/components/ResearchBlocks/Sources.tsxObject with name: string, url: string representing scraped web sources with extracted content
Created during web retrieval phase, validated for accessibility, then displayed in frontend with domain extraction and link formatting
frontend/nextjs/types/data.tsInterface with report_type: string, report_source: string, tone: string, domains: string[], defaultReportType: string, layoutType: string, mcp_enabled: boolean, mcp_configs: MCPConfig[], mcp_strategy?: string
Configured by user in frontend settings, passed to backend to control research behavior like source selection and LLM tone
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
The 'headers' field in ResearchRequest model is optional (dict | None), but downstream code expects it to contain authentication tokens or API keys for external services without validation
If this fails: If headers dict is missing required API keys for configured retrievers (like 'Authorization' for custom APIs), the research pipeline fails silently or produces empty results without clear error messages
backend/server/app.py:ResearchRequest
The progress callback function expects progress object to have specific attributes (current_depth, total_depth, current_breadth, total_breadth, completed_queries, total_queries, current_query) in a particular sequence
If this fails: If GPTResearcher's progress tracking changes its data structure or adds new required fields, the callback crashes with AttributeError during deep research execution
backend/report_type/deep_research/main.py:on_progress
Discord messages can be chunked into 1500-character pieces and sent sequentially without considering Discord's rate limits (50 requests per second per guild)
If this fails: Large research reports cause the bot to exceed Discord's rate limits, resulting in HTTP 429 errors and incomplete message delivery to users
docs/discord-bot/index.js:splitMessage
The cooldown mechanism uses Date.now() and assumes system clock is monotonic and doesn't account for server restarts clearing the in-memory cooldowns object
If this fails: After bot restarts, all cooldown timers reset, allowing spam in help forums immediately instead of respecting the 30-minute intervals
docs/discord-bot/index.js:cooldowns
WebSocket protocol determination relies on simple string matching of 'https' in host URL, but doesn't handle edge cases like custom ports or IP addresses with SSL
If this fails: Connecting to HTTPS endpoints on non-standard ports or SSL-enabled IP addresses fails with incorrect protocol selection (ws:// vs wss://)
docs/npm/index.js:initializeWebSocket
The InMemoryVectorStore can hold all research context and chat history without memory limits or cleanup strategies
If this fails: During long research sessions or multiple concurrent users, memory usage grows unbounded until the server runs out of RAM and crashes
backend/chat/chat.py:InMemoryVectorStore
The sendMessage function expects either 'task' parameter OR both 'query' and 'moreContext' parameters, with no validation to ensure exactly one pattern is used
If this fails: When both 'task' and 'query' are provided, the function creates malformed requests by concatenating query with moreContext and ignoring task, leading to unexpected research behavior
docs/npm/index.js:sendMessage
The sanitize_filename function (referenced but not shown) properly handles all Unicode characters, path traversal attacks, and filesystem-specific restrictions across different operating systems
If this fails: Malicious research tasks with crafted filenames could write reports outside the outputs/ directory or crash the system on Windows with reserved names like 'CON' or 'NUL'
backend/server/app.py:sanitize_filename
The WebSocket response callbacks map uses hardcoded key 'current' and assumes only one active request per GPTResearcher instance at a time
If this fails: Concurrent research requests on the same GPTResearcher instance overwrite each other's callbacks, causing responses to be delivered to wrong handlers or lost entirely
docs/npm/index.js:responseCallbacks
The Express server listens on hardcoded port 5000 without checking if the port is already in use or configurable through environment variables
If this fails: In containerized environments or when port 5000 is occupied, the server fails to start with EADDRINUSE error, causing the Discord bot to become unreachable
docs/discord-bot/server.js:keepAlive
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
InMemoryVectorStore holds research context and chat history as embeddings for similarity search during follow-up questions
Accumulates research findings, sources, sections, and metadata throughout the research pipeline until final report generation
Stores generated PDF, DOCX, and JSON report files with sanitized filenames for user download
Maintains active WebSocket connections for real-time progress streaming with connection lifecycle management
Feedback Loops
- Iterative Research Refinement (recursive, reinforcing) — Trigger: Deep research mode enabled in multi_agents/agent.py. Action: Analyzes current research quality, identifies gaps, generates additional targeted queries, and retrieves more sources. Exit: Research depth threshold met or maximum iterations reached.
- Source Validation Loop (retry, balancing) — Trigger: Web scraping failures or invalid URLs in retrieval phase. Action: Retries failed requests with exponential backoff, filters broken sources, and attempts alternative retrievers. Exit: Successful content extraction or retry limit exceeded.
- Real-time Progress Updates (polling, reinforcing) — Trigger: Active WebSocket connections during research execution. Action: Continuously broadcasts pipeline progress, current subquery, source count, and completion status to frontend clients. Exit: Research completed or client disconnection.
Delays
- Parallel Web Scraping (async-processing, ~5-30 seconds) — Multiple web sources scraped concurrently with asyncio.gather while streaming intermediate results to maintain user engagement
- LLM Response Generation (async-processing, ~2-10 seconds per call) — Analysis and report synthesis steps wait for LLM completions with streaming responses when supported by the provider
- Document Format Conversion (batch-window, ~1-3 seconds) — PDF and DOCX generation happens after report completion using md2pdf and python-docx libraries
Control Points
- LLM Provider Selection (env-var) — Controls: Which LLM service handles query analysis and report generation (OpenAI, Anthropic, Google, etc.). Default: openai
- Research Source Configuration (runtime-toggle) — Controls: Which web sources are enabled for retrieval (web, local, hybrid) affecting scope and speed. Default: web
- Report Type Strategy (runtime-toggle) — Controls: Research methodology (research_report, detailed_report, deep research) determining depth and iteration count. Default: research_report
- Image Generation Toggle (feature-flag) — Controls: Whether to generate and include images in research reports using external image generation services. Default: false
- Logging Verbosity (env-var) — Controls: Detail level of system logs affecting debugging visibility and performance monitoring. Default: INFO
Technology Stack
Provides HTTP/WebSocket API endpoints for research requests with automatic OpenAPI documentation and Pydantic validation
Renders interactive research interface with real-time updates, markdown rendering, and responsive design for desktop/mobile access
Orchestrates multi-agent research workflows with state management, conditional routing, and cycle detection for complex research patterns
Provides LLM abstractions, text splitting for RAG, and unified interfaces for different AI providers with function calling support
Extracts and cleans text content from scraped HTML pages while handling malformed markup and character encoding issues
Primary web search and scraping service for retrieving current information with built-in content extraction and relevance scoring
Validates API request/response schemas and configuration models with automatic type conversion and error reporting
Enables real-time bidirectional communication between frontend and backend for streaming research progress and chat interactions
Containerizes the application with multi-service orchestration for consistent deployment across development and production environments
Key Components
- GPTResearcher (orchestrator) — Main research coordinator that breaks queries into subqueries, manages parallel retrieval across multiple sources, and synthesizes findings into coherent reports using configurable LLM providers
gpt_researcher/__init__.py - WebSocketManager (gateway) — Manages real-time WebSocket connections for streaming research progress, handling client connections, message broadcasting, and graceful disconnection
backend/server/websocket_manager.py - ChatAgentWithMemory (processor) — Processes follow-up questions about research reports using RAG with in-memory vector store, supporting conversation continuity and contextual search
backend/chat/chat.py - RetrieverFactory (factory) — Creates appropriate web scrapers (Tavily, Google, DuckDuckGo, ArXiv, etc.) based on research source configuration and manages parallel retrieval execution
gpt_researcher/retrievers/ - LLMProvider (adapter) — Unified interface for multiple LLM providers (OpenAI, Anthropic, Google, etc.) with consistent chat completion and function calling capabilities
gpt_researcher/llm_provider/ - DocumentProcessor (transformer) — Converts scraped web content into structured text, handles PDF extraction, and processes various document formats for downstream analysis
gpt_researcher/document/ - ReportGenerator (processor) — Synthesizes research findings into formatted reports using configurable templates, tone settings, and output formats (markdown, PDF, DOCX)
gpt_researcher/skills/ - MemoryManager (store) — Manages conversation history and research context using vector embeddings for similarity search and contextual retrieval during follow-up questions
gpt_researcher/memory/ - MultiAgentOrchestrator (orchestrator) — Coordinates specialized research agents using LangGraph workflows, implementing complex research patterns like iterative refinement and multi-perspective analysis
multi_agents/agent.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Inference Repositories
Frequently Asked Questions
What is gpt-researcher used for?
Orchestrates multi-agent LLM research by querying web sources and generating comprehensive reports assafelovic/gpt-researcher is a 9-component ml inference written in Python. Data flows through 8 distinct pipeline stages. The codebase contains 284 files.
How is gpt-researcher architected?
gpt-researcher is organized into 5 architecture layers: Frontend Interface, API Orchestration, Research Engine, Multi-Agent System, and 1 more. Data flows through 8 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through gpt-researcher?
Data moves through 8 stages: Accept research query → Initialize GPTResearcher → Generate subqueries → Parallel web retrieval → Process and analyze sources → .... Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients. This pipeline design reflects a complex multi-stage processing system.
What technologies does gpt-researcher use?
The core stack includes FastAPI (Provides HTTP/WebSocket API endpoints for research requests with automatic OpenAPI documentation and Pydantic validation), Next.js (Renders interactive research interface with real-time updates, markdown rendering, and responsive design for desktop/mobile access), LangGraph (Orchestrates multi-agent research workflows with state management, conditional routing, and cycle detection for complex research patterns), LangChain (Provides LLM abstractions, text splitting for RAG, and unified interfaces for different AI providers with function calling support), BeautifulSoup4 (Extracts and cleans text content from scraped HTML pages while handling malformed markup and character encoding issues), Tavily (Primary web search and scraping service for retrieving current information with built-in content extraction and relevance scoring), and 3 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does gpt-researcher have?
gpt-researcher exhibits 4 data pools (Vector Memory Store, Research State Accumulator), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does gpt-researcher use?
5 design patterns detected: Agent Orchestration, Retrieval-Augmented Generation, Progressive Enhancement, Stream Processing, Plugin Architecture.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.