How DSPy Works

Prompt engineering is manual tuning — you adjust words until the output looks right. DSPy treats it as an optimization problem: define what you want (signatures), provide examples, and let the framework find the prompts and few-shot examples that actually work.

33,832 stars Python 10 components 7-stage pipeline

What dspy Does

Programs language models with declarative Python code and auto-optimizes prompts

DSPy is a framework for building modular AI systems by writing compositional Python code instead of brittle prompts. It automatically optimizes LM prompts and weights using algorithms like bootstrap few-shot learning and genetic prompt evolution. The core philosophy is 'programming, not prompting' — you define signatures (input/output specs) and modules (like ChainOfThought), then DSPy teaches the LM to deliver high-quality outputs.

Architecture Overview

dspy is organized into 6 layers, with 10 components and 0 connections between them.

Signatures

Declarative specifications of input/output contracts — like function signatures but for LM calls, defining what fields to expect and their types

Modules

Composable building blocks that execute signatures — Predict for simple calls, ChainOfThought for reasoning, ReAct for tool use

Adapters

Transform signatures into LM-specific formats and parse responses back — handles chat formatting, JSON schemas, tool calls

Language Models

Unified interface to various LM providers through LiteLLM — handles calls, caching, usage tracking

Optimizers

Automatic prompt and example optimization algorithms — bootstrap learning, genetic evolution, hyperparameter tuning

Evaluation

Metrics and assessment frameworks for measuring program performance and guiding optimization

How Data Flows Through dspy

Programs flow from signature definition through module execution to LM calls and back. Users define signatures specifying inputs/outputs, create modules like Predict or ChainOfThought that implement these signatures, then execute them with actual data. The adapter layer transforms signatures into LM-specific prompts (with field delimiters and formatting), sends them to language models via the client layer, then parses responses back into structured predictions. Optimizers like BootstrapFewShot and GEPA improve this pipeline by generating better examples and instructions.

1Define signature contract

User creates a Signature class with typed input/output fields and optional instructions — DSPy validates field types, generates descriptions, and creates the execution contract that modules must fulfill

2Create module instance

User instantiates a module (Predict, ChainOfThought, ReAct) with the signature — module validates compatibility and prepares for execution

3Execute with input data

Module.__call__ receives input values, validates them against signature fields, and triggers the prediction pipeline

4Format prompt through adapter

ChatAdapter.format_prompt combines signature, input values, few-shot examples, and conversation history into LM messages — uses field headers like [[## question ##]] to structure content and instructs LM on output format

5Call language model

BaseLM.generate sends formatted messages to the LM API (via LiteLLM), handles retries and caching, returns raw text response with usage metadata

6Parse structured response

Adapter.parse_response extracts field values from LM text using header patterns or JSON parsing, validates types against signature, handles errors with fallbacks

7Return prediction result

Module returns Prediction object containing parsed field values, completion metadata, and conversation history — accessible via dot notation like result.answer

System Dynamics

Beyond the pipeline, dspy has runtime behaviors that shape how it responds to load, failures, and configuration changes.

Data Pools

Pool

DSPY_CACHE

Disk-based cache for LM responses using diskcache — prevents duplicate API calls, persists across sessions

Type: cache

Pool

Settings registry

Global configuration store for active LM, adapter, and system settings — maintains context stack for nested configurations

Type: registry

Pool

Few-shot example store

Module-level storage for demonstration examples used in prompts — populated by optimizers, used during execution

Type: buffer

Pool

Conversation history

Per-session message history for multi-turn conversations — maintains context across interactions

Type: state-store

Feedback Loops

Loop

Bootstrap few-shot learning

Trigger: BootstrapFewShot.compile() called → Run program on training examples, collect successful traces, add high-scoring examples as demonstrations (exits when: Max examples reached or no improvement)

Type: training-loop

Loop

GEPA genetic optimization

Trigger: GEPA.compile() called → Mutate instruction text, crossbreed variants, evaluate fitness, select survivors for next generation (exits when: Max generations reached or convergence)

Type: training-loop

Loop

Adapter format fallback

Trigger: ChatAdapter parsing fails → Fall back to JSONAdapter, attempt structured parsing again (exits when: Successful parse or final failure)

Type: retry

Loop

LM retry with backoff

Trigger: LM API call fails or rate limited → Wait with exponential backoff, retry API call (exits when: Success or max retries exceeded)

Type: retry

Control Points

Control

LM provider selection

Control

Adapter choice

Control

Max tokens limit

Control

Cache enabled

Control

Few-shot count

Control

Temperature setting

Delays

Delay

Cache lookup

Duration: immediate if hit

Delay

LM API generation

Duration: variable by model

Delay

Optimization compilation

Duration: minutes to hours

Delay

Response streaming

Duration: partial results available immediately

Technology Choices

dspy is built with 9 key technologies. Each serves a specific role in the system.

LiteLLM

Unified API client for multiple language model providers — handles OpenAI, Anthropic, local models with consistent interface

Pydantic

Type validation and serialization for signatures, custom types, and configuration models

DiskCache

Persistent caching of language model responses to reduce API costs and latency

Tenacity

Retry logic with exponential backoff for resilient LM API calls

JSON Repair

Attempts to fix malformed JSON in LM responses before parsing

Regex

Pattern matching for parsing structured outputs from LM text responses

Asyncio

Asynchronous execution support for non-blocking LM calls and streaming responses

Optuna

Hyperparameter optimization for teleprompt algorithms

CloudPickle

Serialization of complex Python objects for caching and persistence

Key Components

Signature (validator): Validates and structures input/output specifications for LM calls — ensures type consistency, generates field descriptions, provides the contract that modules must fulfill
Predict (executor): Core module that executes signatures by formatting them through adapters, calling LMs, and parsing responses — the basic building block for all DSPy programs
ChatAdapter (adapter): Transforms signatures into chat-formatted prompts using field header delimiters like [[## field_name ##]] and parses structured responses back into predictions
BaseLM (gateway): Abstract interface for language model providers — standardizes generation, embedding, tool calling across different APIs while handling caching and usage tracking
BootstrapFewShot (optimizer): Automatically generates effective few-shot examples by running programs on training data, selecting high-quality input-output traces, and including them in future prompts
GEPA (optimizer): Genetic algorithm for prompt evolution — mutates instruction text, crossbreeds successful variants, and selects improvements based on evaluation metrics
ReAct (orchestrator): Implements the Reasoning-Acting pattern for tool use — alternates between thought, action, and observation steps until reaching a final answer
Example (store): Immutable container for training examples and demonstrations — stores input/output pairs with metadata like reasoning traces and quality scores
Evaluate (processor): Evaluates program performance on datasets using custom metrics — runs programs on test cases, computes scores, provides optimization feedback
Type (adapter): Base class for custom content types like Image, Audio, Code — provides format() method to convert structured data into LM-compatible content representations

Who Should Read This

ML researchers and engineers who want to move beyond manual prompt engineering, or teams building complex LLM pipelines.

This analysis was generated by CodeSea from the stanfordnlp/dspy source code. For the full interactive visualization — including pipeline graph, architecture diagram, and system behavior map — see the complete analysis.

Explore Further

Full Analysis

Interactive architecture map for dspy

dspy vs langchain

Side-by-side architecture comparison

dspy vs llama_index

Side-by-side architecture comparison

dspy vs guidance

Side-by-side architecture comparison

How LangChain Works

ML Inference & Agents

How LlamaIndex Works

ML Inference & Agents

How vLLM Works

ML Inference & Agents

Frequently Asked Questions

What is dspy?

Programs language models with declarative Python code and auto-optimizes prompts

How does dspy's pipeline work?

dspy processes data through 7 stages: Define signature contract, Create module instance, Execute with input data, Format prompt through adapter, Call language model, and more. Programs flow from signature definition through module execution to LM calls and back. Users define signatures specifying inputs/outputs, create modules like Predict or ChainOfThought that implement these signatures, then execute them with actual data. The adapter layer transforms signatures into LM-specific prompts (with field delimiters and formatting), sends them to language models via the client layer, then parses responses back into structured predictions. Optimizers like BootstrapFewShot and GEPA improve this pipeline by generating better examples and instructions.

What tech stack does dspy use?

dspy is built with LiteLLM (Unified API client for multiple language model providers — handles OpenAI, Anthropic, local models with consistent interface), Pydantic (Type validation and serialization for signatures, custom types, and configuration models), DiskCache (Persistent caching of language model responses to reduce API costs and latency), Tenacity (Retry logic with exponential backoff for resilient LM API calls), JSON Repair (Attempts to fix malformed JSON in LM responses before parsing), and 4 more technologies.

How does dspy handle errors and scaling?

dspy uses 4 feedback loops, 6 control points, 4 data pools to manage its runtime behavior. These mechanisms handle error recovery, load distribution, and configuration changes.

How does dspy compare to langchain?

CodeSea has detailed side-by-side architecture comparisons of dspy with langchain, llama_index, guidance. These cover tech stack differences, pipeline design, and system behavior.

How DSPy Works

What dspy Does

Architecture Overview

How Data Flows Through dspy

1Define signature contract

2Create module instance

3Execute with input data

4Format prompt through adapter

5Call language model

6Parse structured response

7Return prediction result

System Dynamics

Data Pools

DSPY_CACHE

Settings registry

Few-shot example store

Conversation history

Feedback Loops

Bootstrap few-shot learning

GEPA genetic optimization

Adapter format fallback

LM retry with backoff

Control Points

LM provider selection

Adapter choice

Max tokens limit

Cache enabled

Few-shot count

Temperature setting

Delays

Cache lookup

LM API generation

Optimization compilation

Response streaming

Technology Choices

Key Components

Who Should Read This

Explore Further

Full Analysis

dspy vs langchain

dspy vs llama_index

dspy vs guidance

How LangChain Works

How LlamaIndex Works

How vLLM Works

Frequently Asked Questions

Visualize dspy yourself