assafelovic/gpt-researcher
An autonomous agent that conducts deep research on any data using any LLM providers
Autonomous research agent that generates comprehensive reports with citations
Research query flows through sub-query generation, parallel source retrieval, content scraping, context synthesis, and report generation with real-time progress streaming
Under the hood, the system uses 2 feedback loops, 3 data pools, 4 control points to manage its runtime behavior.
Structural Verdict
A 10-component ml inference with 11 connections. 281 files analyzed. Well-connected — clear data flow between components.
How Data Flows Through the System
Research query flows through sub-query generation, parallel source retrieval, content scraping, context synthesis, and report generation with real-time progress streaming
- Query Processing — Break down research question into focused sub-queries (config: services.gpt-researcher.environment.OPENAI_API_KEY, services.gpt-researcher.environment.OPENAI_BASE_URL)
- Source Discovery — Search web using multiple retriever strategies in parallel (config: services.gpt-researcher.environment.TAVILY_API_KEY, services.gpt-researcher.environment.GOOGLE_API_KEY)
- Content Extraction — Scrape and clean content from discovered URLs
- Context Synthesis — Filter and consolidate relevant information from sources
- Report Generation — Generate structured report with citations using LLM (config: services.gpt-researcher.environment.OPENAI_API_KEY, tone)
- Output Formatting — Export report as PDF, DOCX, or JSON formats
System Behavior
How the system actually operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Accumulates research findings, sources, and intermediate results during research session
Stores document embeddings for similarity search and retrieval
Generated reports in PDF, DOCX, and JSON formats
Feedback Loops
- Query Refinement Loop (recursive, balancing) — Trigger: Insufficient or low-quality search results. Action: Generate additional sub-queries and search different sources. Exit: Sufficient relevant content gathered or max iterations reached.
- Source Validation Loop (retry, balancing) — Trigger: Failed URL scraping or invalid content. Action: Try alternative sources or retriever strategies. Exit: Successful content extraction or source exhaustion.
Delays & Async Processing
- API Rate Limiting (rate-limit, ~Variable per provider) — Delays between search API calls to avoid hitting rate limits
- LLM Token Processing (async-processing, ~1-30 seconds) — Time required for LLM to process context and generate responses
- Web Scraping Delays (async-processing, ~1-10 seconds per URL) — Network latency and content parsing time for each scraped page
Control Points
- LLM Provider Selection (env-var) — Controls: Which LLM provider to use (OpenAI, Anthropic, Google, etc.). Default: OPENAI_API_KEY
- Search Provider Priority (env-var) — Controls: Primary search service (Tavily, Google, DuckDuckGo). Default: TAVILY_API_KEY
- Report Tone (runtime-toggle) — Controls: Writing style and tone of generated reports. Default: Reflective
- Research Depth (feature-flag) — Controls: Type of research conducted (basic, deep, custom). Default: research_report
Technology Stack
Web server and API framework
Real-time communication
Frontend React framework
LLM orchestration and document processing
HTML parsing and content extraction
Primary web search API
LLM provider for text generation
Database ORM
Testing framework
Key Components
- GPTResearcher (class) — Main orchestrator class that conducts research and generates reports
gpt_researcher/__init__.py - conduct_research (function) — Executes the research workflow including query generation and source gathering
gpt_researcher/actions/conduct_research.py - get_sub_queries (function) — Generates focused sub-queries from the main research question
gpt_researcher/actions/query_processing.py - scrape_sites_by_query (function) — Scrapes web content for each sub-query using multiple retrievers
gpt_researcher/actions/web_scraper.py - WebSocketManager (class) — Manages real-time communication between frontend and research engine
backend/server/websocket_manager.py - run_agent (function) — Executes research agent with progress streaming via WebSocket
backend/server/websocket_manager.py - TavilySearch (class) — Primary search retriever using Tavily API for web search
gpt_researcher/retrievers/tavily/tavily_search.py - scrape_url (function) — Extracts and cleans content from web pages
gpt_researcher/scraper/scraper.py - create_chat_completion (function) — Unified LLM interface supporting multiple providers (OpenAI, Anthropic, Google)
gpt_researcher/utils/llm.py - write_report (function) — Synthesizes research data into structured reports with citations
gpt_researcher/actions/write_report.py
Sub-Modules
Discord integration for research requests via bot commands
LangGraph-based multi-agent research workflow with specialized roles
JavaScript/Node.js SDK for GPT Researcher integration
Configuration
docker-compose.yml (yaml)
services.gpt-researcher.pull_policy(string, unknown) — default: buildservices.gpt-researcher.image(string, unknown) — default: gptresearcher/gpt-researcherservices.gpt-researcher.build(string, unknown) — default: ./services.gpt-researcher.environment.OPENAI_API_KEY(string, unknown) — default: ${OPENAI_API_KEY}services.gpt-researcher.environment.OPENAI_BASE_URL(string, unknown) — default: ${OPENAI_BASE_URL}services.gpt-researcher.environment.TAVILY_API_KEY(string, unknown) — default: ${TAVILY_API_KEY}services.gpt-researcher.environment.LANGCHAIN_API_KEY(string, unknown) — default: ${LANGCHAIN_API_KEY}services.gpt-researcher.environment.LOGGING_LEVEL(string, unknown) — default: INFO- +36 more parameters
langgraph.json (json)
python_version(string, unknown) — default: 3.11dependencies(array, unknown) — default: ./multi_agentsgraphs.agent(string, unknown) — default: ./multi_agents/agent.py:graphenv(string, unknown) — default: .env
backend/server/app.py (python-pydantic)
task(str, unknown)report_type(str, unknown)report_source(str, unknown)tone(str, unknown)repo_name(str, unknown)branch_name(str, unknown)generate_in_background(bool, unknown) — default: True
backend/server/app.py (python-pydantic)
report(str, unknown)messages(List[Dict[str, Any]], unknown)
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Inference Repositories
Frequently Asked Questions
What is gpt-researcher used for?
Autonomous research agent that generates comprehensive reports with citations assafelovic/gpt-researcher is a 10-component ml inference written in Python. Well-connected — clear data flow between components. The codebase contains 281 files.
How is gpt-researcher architected?
gpt-researcher is organized into 5 architecture layers: Research Engine, Data Collection, Processing Layer, API Layer, and 1 more. Well-connected — clear data flow between components. This layered structure enables tight integration between components.
How does data flow through gpt-researcher?
Data moves through 6 stages: Query Processing → Source Discovery → Content Extraction → Context Synthesis → Report Generation → .... Research query flows through sub-query generation, parallel source retrieval, content scraping, context synthesis, and report generation with real-time progress streaming This pipeline design reflects a complex multi-stage processing system.
What technologies does gpt-researcher use?
The core stack includes FastAPI (Web server and API framework), WebSockets (Real-time communication), Next.js (Frontend React framework), LangChain (LLM orchestration and document processing), BeautifulSoup (HTML parsing and content extraction), Tavily (Primary web search API), and 3 more. This broad technology surface reflects a mature project with many integration points.
What system dynamics does gpt-researcher have?
gpt-researcher exhibits 3 data pools (Research Context Store, Vector Store), 2 feedback loops, 4 control points, 3 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does gpt-researcher use?
4 design patterns detected: Async Agent Pipeline, Strategy Pattern for Retrievers, WebSocket Progress Streaming, Multi-Provider LLM Support.
Analyzed on March 31, 2026 by CodeSea. Written by Karolina Sarna.