assafelovic/gpt-researcher

An autonomous agent that conducts deep research on any data using any LLM providers

26,567 stars Python 9 components

Orchestrates multi-agent LLM research by querying web sources and generating comprehensive reports

Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients.

Under the hood, the system uses 3 feedback loops, 4 data pools, 5 control points to manage its runtime behavior.

A 9-component ml inference. 284 files analyzed. Data flows through 8 distinct pipeline stages.

How Data Flows Through the System

Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients.

  1. Accept research query — FastAPI server receives POST request with research task, validates ResearchRequest schema including task description, report type, source preferences, and tone settings
  2. Initialize GPTResearcher — Creates GPTResearcher instance with validated parameters, loads configuration from environment variables for LLM provider, API keys, and research settings [ResearchRequest → ResearchState] (config: OPENAI_API_KEY, TAVILY_API_KEY, LOGGING_LEVEL)
  3. Generate subqueries — LLM analyzes main research task and breaks it into 3-5 specific subqueries using create_chat_completion, each targeting different aspects of the research topic [ResearchState → List of subqueries]
  4. Parallel web retrieval — RetrieverFactory spawns multiple scrapers (Tavily, Google, DuckDuckGo) in parallel using asyncio.gather, each executing subqueries and collecting web sources with content extraction [List of subqueries → SourceData] (config: GOOGLE_API_KEY, IMAGE_GENERATION_ENABLED)
  5. Process and analyze sources — DocumentProcessor cleans scraped content, extracts text from PDFs/HTML, while LLM analyzes each source for relevance and extracts key insights using configurable analysis prompts [SourceData → Analyzed content]
  6. Synthesize research findings — ReportGenerator uses LLM to synthesize all analyzed content into coherent report sections following configured template structure and tone settings from ResearchRequest [Analyzed content → ResearchState]
  7. Generate final report — LLM creates formatted markdown report with introduction, main sections, conclusion, and citations, then converts to PDF/DOCX using write_md_to_pdf and write_md_to_word utilities [ResearchState → Final report]
  8. Stream research progress — WebSocketManager broadcasts real-time updates throughout the pipeline using JSON messages with type, content, and metadata fields to connected frontend clients [ChatMessage → Stream updates]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ResearchRequest backend/server/app.py
Pydantic model with task: str, report_type: str, report_source: str, tone: str, headers: dict, repo_name: str, branch_name: str, generate_in_background: bool
Created from incoming HTTP requests, validated by Pydantic, then passed to research orchestrator for task execution
ChatMessage frontend/nextjs/types/data.ts
TypeScript interface with role: 'user'|'assistant'|'system', content: string, timestamp?: number, metadata?: any
Generated during research streaming to show progress updates, stored in frontend state, and displayed in chat interface
ResearchState backend/memory/research.py
TypedDict with task: dict, initial_research: str, sections: List[str], research_data: List[dict], title: str, headers: dict, date: str, table_of_contents: str, introduction: str, conclusion: str, sources: List[str], report: str
Accumulates research findings across the pipeline, starting with task definition and building up sections until final report synthesis
SourceData frontend/nextjs/components/ResearchBlocks/Sources.tsx
Object with name: string, url: string representing scraped web sources with extracted content
Created during web retrieval phase, validated for accessibility, then displayed in frontend with domain extraction and link formatting
ChatBoxSettings frontend/nextjs/types/data.ts
Interface with report_type: string, report_source: string, tone: string, domains: string[], defaultReportType: string, layoutType: string, mcp_enabled: boolean, mcp_configs: MCPConfig[], mcp_strategy?: string
Configured by user in frontend settings, passed to backend to control research behavior like source selection and LLM tone

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Contract unguarded

The 'headers' field in ResearchRequest model is optional (dict | None), but downstream code expects it to contain authentication tokens or API keys for external services without validation

If this fails: If headers dict is missing required API keys for configured retrievers (like 'Authorization' for custom APIs), the research pipeline fails silently or produces empty results without clear error messages

backend/server/app.py:ResearchRequest
critical Ordering unguarded

The progress callback function expects progress object to have specific attributes (current_depth, total_depth, current_breadth, total_breadth, completed_queries, total_queries, current_query) in a particular sequence

If this fails: If GPTResearcher's progress tracking changes its data structure or adds new required fields, the callback crashes with AttributeError during deep research execution

backend/report_type/deep_research/main.py:on_progress
critical Resource unguarded

Discord messages can be chunked into 1500-character pieces and sent sequentially without considering Discord's rate limits (50 requests per second per guild)

If this fails: Large research reports cause the bot to exceed Discord's rate limits, resulting in HTTP 429 errors and incomplete message delivery to users

docs/discord-bot/index.js:splitMessage
warning Temporal weakly guarded

The cooldown mechanism uses Date.now() and assumes system clock is monotonic and doesn't account for server restarts clearing the in-memory cooldowns object

If this fails: After bot restarts, all cooldown timers reset, allowing spam in help forums immediately instead of respecting the 30-minute intervals

docs/discord-bot/index.js:cooldowns
warning Environment weakly guarded

WebSocket protocol determination relies on simple string matching of 'https' in host URL, but doesn't handle edge cases like custom ports or IP addresses with SSL

If this fails: Connecting to HTTPS endpoints on non-standard ports or SSL-enabled IP addresses fails with incorrect protocol selection (ws:// vs wss://)

docs/npm/index.js:initializeWebSocket
critical Scale unguarded

The InMemoryVectorStore can hold all research context and chat history without memory limits or cleanup strategies

If this fails: During long research sessions or multiple concurrent users, memory usage grows unbounded until the server runs out of RAM and crashes

backend/chat/chat.py:InMemoryVectorStore
warning Contract unguarded

The sendMessage function expects either 'task' parameter OR both 'query' and 'moreContext' parameters, with no validation to ensure exactly one pattern is used

If this fails: When both 'task' and 'query' are provided, the function creates malformed requests by concatenating query with moreContext and ignoring task, leading to unexpected research behavior

docs/npm/index.js:sendMessage
critical Domain weakly guarded

The sanitize_filename function (referenced but not shown) properly handles all Unicode characters, path traversal attacks, and filesystem-specific restrictions across different operating systems

If this fails: Malicious research tasks with crafted filenames could write reports outside the outputs/ directory or crash the system on Windows with reserved names like 'CON' or 'NUL'

backend/server/app.py:sanitize_filename
warning Temporal unguarded

The WebSocket response callbacks map uses hardcoded key 'current' and assumes only one active request per GPTResearcher instance at a time

If this fails: Concurrent research requests on the same GPTResearcher instance overwrite each other's callbacks, causing responses to be delivered to wrong handlers or lost entirely

docs/npm/index.js:responseCallbacks
warning Environment unguarded

The Express server listens on hardcoded port 5000 without checking if the port is already in use or configurable through environment variables

If this fails: In containerized environments or when port 5000 is occupied, the server fails to start with EADDRINUSE error, causing the Discord bot to become unreachable

docs/discord-bot/server.js:keepAlive

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Vector Memory Store (in-memory)
InMemoryVectorStore holds research context and chat history as embeddings for similarity search during follow-up questions
Research State Accumulator (state-store)
Accumulates research findings, sources, sections, and metadata throughout the research pipeline until final report generation
Output Files Directory (file-store)
Stores generated PDF, DOCX, and JSON report files with sanitized filenames for user download
WebSocket Connection Pool (registry)
Maintains active WebSocket connections for real-time progress streaming with connection lifecycle management

Feedback Loops

Delays

Control Points

Technology Stack

FastAPI (framework)
Provides HTTP/WebSocket API endpoints for research requests with automatic OpenAPI documentation and Pydantic validation
Next.js (framework)
Renders interactive research interface with real-time updates, markdown rendering, and responsive design for desktop/mobile access
LangGraph (framework)
Orchestrates multi-agent research workflows with state management, conditional routing, and cycle detection for complex research patterns
LangChain (library)
Provides LLM abstractions, text splitting for RAG, and unified interfaces for different AI providers with function calling support
BeautifulSoup4 (library)
Extracts and cleans text content from scraped HTML pages while handling malformed markup and character encoding issues
Tavily (library)
Primary web search and scraping service for retrieving current information with built-in content extraction and relevance scoring
Pydantic (library)
Validates API request/response schemas and configuration models with automatic type conversion and error reporting
WebSockets (library)
Enables real-time bidirectional communication between frontend and backend for streaming research progress and chat interactions
Docker (infra)
Containerizes the application with multi-service orchestration for consistent deployment across development and production environments

Key Components

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is gpt-researcher used for?

Orchestrates multi-agent LLM research by querying web sources and generating comprehensive reports assafelovic/gpt-researcher is a 9-component ml inference written in Python. Data flows through 8 distinct pipeline stages. The codebase contains 284 files.

How is gpt-researcher architected?

gpt-researcher is organized into 5 architecture layers: Frontend Interface, API Orchestration, Research Engine, Multi-Agent System, and 1 more. Data flows through 8 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through gpt-researcher?

Data moves through 8 stages: Accept research query → Initialize GPTResearcher → Generate subqueries → Parallel web retrieval → Process and analyze sources → .... Research requests enter through the web interface or API, get converted into structured queries with subquestions, then trigger parallel web scraping across multiple sources. Retrieved content is processed and analyzed by LLMs, with findings accumulated into a research state that gets synthesized into the final report while streaming progress updates to connected clients. This pipeline design reflects a complex multi-stage processing system.

What technologies does gpt-researcher use?

The core stack includes FastAPI (Provides HTTP/WebSocket API endpoints for research requests with automatic OpenAPI documentation and Pydantic validation), Next.js (Renders interactive research interface with real-time updates, markdown rendering, and responsive design for desktop/mobile access), LangGraph (Orchestrates multi-agent research workflows with state management, conditional routing, and cycle detection for complex research patterns), LangChain (Provides LLM abstractions, text splitting for RAG, and unified interfaces for different AI providers with function calling support), BeautifulSoup4 (Extracts and cleans text content from scraped HTML pages while handling malformed markup and character encoding issues), Tavily (Primary web search and scraping service for retrieving current information with built-in content extraction and relevance scoring), and 3 more. This broad technology surface reflects a mature project with many integration points.

What system dynamics does gpt-researcher have?

gpt-researcher exhibits 4 data pools (Vector Memory Store, Research State Accumulator), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does gpt-researcher use?

5 design patterns detected: Agent Orchestration, Retrieval-Augmented Generation, Progressive Enhancement, Stream Processing, Plugin Architecture.

Analyzed on April 20, 2026 by CodeSea. Written by .