Portable, model-agnostic deep research agent. Built on deepagents + LangGraph. Plans, asks clarifying questions when the request is ambiguous, spawns parallel sub-researchers, calls web search + MCP tools, runs code in an optional sandbox, applies on-disk skills, and writes a cited report — exposing a typed streaming-event protocol so any frontend can render the Claude/Gemini-style live research UI (clarification cards, search queries, website grid, MCP calls, skills, thinking, interleaved [n] citations).
Design goals
- No host-app dependency. Copy this directory (or
uv pip install -e .) into any app. The only seam isResearchConfig(env + per-runconfigurable). Zero imports from your backend. - Not model-locked. Every model goes through an OpenAI-compatible
base_url(OpenRouter by default). Models are organized as named price tiers defined in code (MODEL_TIERS: any OpenRouter slug, local vLLM, …); runtime selects a tier by name (DRA_MODEL_TIER=extra-low|low|mid|high, see Model tiers). - Replaceable parts. Search backend, MCP servers, skills, prompts, and the event emitter are all isolated modules.
This project is managed with uv (uv.lock is committed). Use uv — do not pip install into your base interpreter.
cp .env.example .env # set OPENAI_API_KEY, TAVILY_API_KEY (+ optional DRA_MCP_*)
./run.sh # sync deps (first run) + start the dev server on :2024run.sh is a one-command dev bring-up. It loads ./.env (so the script and server share config), syncs ./.venv on first run, and starts the LangGraph server:
| Command | Does |
|---|---|
./run.sh (or ./run.sh up) |
Sync deps if ./.venv is missing, then start the dev server (API + docs at http://127.0.0.1:2024/docs). Warns if OPENAI_API_KEY / TAVILY_API_KEY are unset. |
./run.sh --sync |
Force uv sync --extra dev, then start the server. |
./run.sh ask "<question>" |
Stream one research run against an already-running server. |
./run.sh smoke |
ask a canned question against a running server. |
./run.sh test |
Sync, then run the offline pytest suite (no API keys / network). |
Host/port follow DRA_HOST (default 127.0.0.1) and PORT (default 2024). ask/smoke need the server up in another shell first. The equivalent manual commands:
uv sync --extra dev # create ./.venv with all deps + the langgraph CLI
uv run langgraph dev --host 127.0.0.1 --port 2024
uv run python examples/client.py "What are the recent trends across the tracked entities, and where can I find supporting data?"Graph id: deep_research_agent — set this as your caller's assistant_id.
Tests live in tests/ (e.g. the deterministic report-hygiene guard — scrub_report + lint_citations / report_problems). Pure-Python, no API keys or network needed. pytest ships in the dev extra, so the suite runs inside ./.venv alongside the runtime deps:
./run.sh test # sync + run the suite (equivalent to the two commands below)
uv sync --extra dev # installs pytest + deepagents + the langgraph CLI into ./.venv
uv run pytest tests/ -qResolution order for every field: per-run configurable override → env var → default. configurable accepts both this package's native keys and compatibility aliases (research_model, final_report_model, apiKeys, mcp_config, mcp_prompt) so an existing caller can adopt the agent with zero backend changes.
| Env var | Default | Purpose |
|---|---|---|
OPENAI_API_KEY (or OPENROUTER_API_KEY) |
— | Key sent as Bearer to OPENAI_BASE_URL |
OPENAI_BASE_URL |
https://openrouter.ai/api/v1 |
OpenAI-compatible endpoint |
DRA_ALLOWED_BASE_URLS |
— | Comma-separated allowlist of extra base URLs a run may override to (key-exfiltration guard) |
TAVILY_API_KEY |
— | Web search; if unset, the web_search tool is omitted |
DRA_MODEL_TIER |
extra-low |
Named model package: extra-low | low | mid | high (see Model tiers below). The only model knob — individual models are chosen in code (MODEL_TIERS), never per env/run |
DRA_MCP_URL |
— | Single MCP server (bare host → /mcp appended) |
DRA_MCP_LABEL |
— | Friendly name for that server in the report's Sources |
DRA_MCP_SERVERS |
— | JSON list of {label, url} for multiple servers |
DRA_MCP_BEARER |
— | Bearer token attached to every MCP server lacking explicit auth |
DRA_MCP_MAX_CONCURRENCY |
10 |
Hard ceiling on simultaneous MCP calls across the whole run |
DRA_MCP_RATE_LIMIT_MAX_WAIT |
120 |
Per-call 429 backoff budget (seconds) before the call fails |
DRA_SKILLS_DIR |
./skills |
Directory of agent skills (see below) |
DRA_STREAMING |
true |
Token-by-token streaming; set false for models with off-spec streaming chunks |
DRA_STREAMING_DENYLIST |
deepseek-v4-flash |
Comma-separated model-name substrings that force streaming off |
DRA_RECURSION_LIMIT |
4500 |
LangGraph super-step ceiling for the orchestrator loop (caps loops, not tool calls) |
DRA_MAX_TOOL_CALLS |
200 |
Cumulative tool-call ceiling per run (BudgetMiddleware) before a hard stop |
DRA_MAX_TOTAL_TOKENS |
4000000 |
Cumulative token ceiling per run; soft wrap-up nudge at 75%, hard stop at 100% |
DRA_MAX_RESULT_CHARS |
60000 |
Per-call MCP result size over which the result offloads to a file (or truncates, no sandbox) |
DRA_MAX_RESULT_ROWS |
1000 |
Per-call MCP result row count that triggers the same offload/truncate |
DRA_OFFLOAD_RESULTS |
true |
Offload large MCP results to the sandbox filesystem instead of truncating them |
DRA_OFFLOAD_DIR |
/workspace/data |
Directory (inside the sandbox) for offloaded result files |
LLM_SANDBOX_URL |
— | Code-execution sandbox sidecar; when set, the execute tool runs real shell/Python/JS |
LLM_SANDBOX_TOKEN |
— | Auth token; must match the sandbox service's LLM_SANDBOX_TOKEN |
LLM_SANDBOX_NETWORK |
false |
Allow outbound network from inside the sandbox |
LLM_SANDBOX_SESSION_TIMEOUT |
900 |
Sandbox session timeout (seconds) |
Per-run configurable keys mirror these: model_tier, apiKeys.{OPENAI_API_KEY,TAVILY_API_KEY}, base_url (allowlisted only), temperature, search_max_results, max_concurrent_research_units, mcp_servers / mcp_config, mcp_prompt, mcp_max_concurrency, mcp_rate_limit_max_wait, skills_dir, streaming, streaming_denylist, recursion_limit, max_tool_calls, max_total_tokens, max_result_chars, max_result_rows, offload_results, offload_dir, sandbox_url, sandbox_token, sandbox_network, sandbox_session_timeout.
Models are chosen by NAME only: DRA_MODEL_TIER=mid (or per-run configurable.model_tier). Which models a name means is decided in code — MODEL_TIERS in config.py, one reviewed place — and is not settable per env var or per run; legacy per-model keys (research_model, final_report_model, compression_model, …) are ignored with a warning. The default, when nothing is configured, is extra-low — a bare checkout can't silently burn money; opt up explicitly for real work. An unknown tier name warns and falls back to the default. OpenRouter slugs, prices $/M input/output as of 2026-06:
| Tier | Research (orchestrator) | Sub-agent | Utility |
|---|---|---|---|
extra-low |
deepseek/deepseek-v4-flash (0.10/0.20) |
deepseek/deepseek-v4-flash (0.10/0.20) |
qwen/qwen3-30b-a3b-instruct-2507 (0.05/0.19) |
low |
deepseek/deepseek-v4-pro (0.44/0.87) |
deepseek/deepseek-v4-flash (0.10/0.20) |
deepseek/deepseek-v4-flash |
mid |
google/gemini-3.5-flash (1.50/9) |
google/gemini-2.5-flash (0.30/2.50) |
deepseek/deepseek-v4-flash |
high |
anthropic/claude-opus-4.8 (5/25) |
anthropic/claude-sonnet-4.6 (3/15) |
anthropic/claude-haiku-4.5 (1/5) |
extra-low is rock bottom — deepseek-v4-flash for both tool-loop roles (it's proven reliable here as this tier's orchestrator and the low tier's sub-agent, unlike the cheaper-but-flakier open-weight options that gave up mid-loop), so the sub-agent is never pricier than the orchestrator. Delegation still pays off via context isolation. ~$0.02 of orchestrator spend per medium run. Expect noticeably weaker planning and earlier give-ups than higher tiers; the force-completion / findings-gate / budget backstops keep runs honest, not great. For demos, smoke tests, and high-volume low-stakes scheduled ticks — not for decisions. high deliberately keeps sub-agent/utility at sonnet/haiku tier — Opus plans and synthesizes only; an Opus sub-agent fleet would defeat the tiering. To add your own packaging: add an entry to MODEL_TIERS (code), pick a name, and document it in this table — callers then select it with DRA_MODEL_TIER=<name>. An unknown tier name is ignored with a warning (plain defaults apply).
Stream with stream_mode=["messages","updates","custom"] and stream_subgraphs=True. The custom channel carries protocol events (each a JSON object with type); the messages channel carries assistant thinking tokens for the collapsible pane.
type |
Key fields | Renders as |
|---|---|---|
clarification |
questions[] |
Question card; input re-enabled (user replies on the same thread) |
search_query |
id, query, source |
Globe row |
search_results |
id, query, ok, count, results[].{title,url,domain,snippet} |
Favicon + title grid |
source |
title, url, domain |
Live citation list entry |
mcp_call |
id, tool, args |
MCP call row |
mcp_result |
id, tool, ok, summary; on failure error_class = permanent | transient | unknown (+ repeated when an identical failed call was answered locally) |
MCP result row |
skill |
name, path, state |
"Skill applied: <name>" indicator |
subagent_findings |
unit, summary, findings[].{finding,evidence,source}, gaps[] |
Folded findings table (one per sub-agent); emitted when a sub-agent's findings validate |
report |
markdown |
Final answer (also in state final_report) |
usage |
tool_calls, total_tokens, model_calls, limits{}, … |
Per-run ledger at run end (no UI; logging / cost tracking) |
status |
state = mcp_ready | mcp_error | budget_soft | budget_halt | revising | done |
Lifecycle / errors |
status detail: mcp_ready carries tool_count + tools[]; mcp_error carries detail, server, label; budget_soft is the 75% wrap-up nudge and budget_halt the hard ceiling stop (see budgets below); revising fires when a gate bounces a deliverable back for one revision — reason: report_quality (final report) or reason: subagent_findings (a sub-agent's findings handoff); done fires when the report is finalized.
The usage event (from metering.py) reports orchestrator-level token counts plus global tool-call / result-size totals and the configured ceilings — emitted once at run end for logging and cost tracking.
Final thread state also exposes final_report (string) and sources ([{index,url,domain}]) — structured citations independent of the inline [n] markers the writer model produces.
Async / background runs (Gemini-style "leave this chat"). LangGraph persists the thread, so a run survives client disconnect. Reconnect by joining the run stream or polling GET /threads/{id}/state for final_report.
When a request is ambiguous (unclear scope, timeframe, entity, or goal) the orchestrator calls request_clarification up front, emits a clarification event, and stops. The user's reply lands on the same thread as the next message, so the agent then has the Q&A in context and proceeds to research. A deterministic fallback (ClarificationFallbackMiddleware) emits the same event if a model narrates questions in prose without calling the tool, so the card always appears regardless of model.
Skills are folders under ./skills/, each with a SKILL.md (progressive-disclosure instructions the agent reads on demand). They're mounted read-only at the virtual path /skills/; the agent reads them via read_file("/skills/<name>/SKILL.md") while its own scratch files stay in an ephemeral state backend. The first time a skill is read in a turn, a skill event fires ("Skill applied: <name>"). Point elsewhere with DRA_SKILLS_DIR / configurable.skills_dir; if the directory is absent the agent runs normally with no skills.
Add a deployment-specific tool without touching the generic codebase: drop a *.py file in ./custom_tools/ and restart. Each file subclasses CustomTool — set name / description, implement run — and the loader auto-discovers it, infers the arg schema from run's typed params, and gives it to the orchestrator and every sub-agent.
# custom_tools/weather.py
from deep_research_agent.tools.custom import CustomTool
class WeatherNow(CustomTool):
name = "weather_now"
description = "Current weather for a city. Cite as 'OpenWeather'."
async def run(self, city: str) -> str: # sync def works too
# self.cfg is the run config; return a string (hardcoded here as an example)
return f"{city}: 21°C, clear skies, humidity 48%. Source: OpenWeather."run may be sync or async; self.cfg is the live ResearchConfig. Return value: the model always sees a string — return a str (JSON-encode structured data yourself), or a list/dict and the framework JSON-encodes it for you; a large list of rows is offloaded to a file the execute tool reads back. Override enabled(cls, cfg) -> bool to load conditionally (e.g. only when an env var is set). Copy custom_tools/_template.py to start; for dynamic cases a build_tools(cfg) / build_tool(cfg) factory returning LangChain tools is also accepted. Point elsewhere with DRA_CUSTOM_TOOLS_DIR. Full guide: docs/CUSTOM_TOOLS.md.
It speaks the LangGraph HTTP/SSE API, so any consumer (the included examples/client.py, the JS @langchain/langgraph-sdk, or raw SSE) works. To wire it into an existing deployment:
- Run this graph (point your dev script /
langgraph.jsonat it). - Set
assistant_idtodeep_research_agent. - Pass per-run config via
configurable(see the Configuration table above). - To get the rich live UI, have the frontend additionally consume the
customevent channel above.
Who connects, and where the config comes from. The agent is always the MCP client — it opens the connection itself (at graph build, agent.py → load_mcp_tools) and the model calls the resulting tools during research. There is no separate connector process. What varies is where the server list (url + auth) is resolved from. Precedence (first non-empty wins, config.py):
configurable.mcp_servers— per-run request (native).configurable.mcp_config— per-run request (compat alias). The normal host-app path: the backend injects url +headers(incl. auth) into every run, so the env vars below are never consulted.DRA_MCP_SERVERS— env (JSON list).DRA_MCP_URL(+DRA_MCP_LABEL) — env (single server).
So when a request arrives with MCP config, the agent connects using that (and its auth). When a bare run arrives without it — e.g. a Studio / langgraph dev trigger, or any caller that omits configurable.mcp_config — it falls back to the DRA_MCP_* env entry. The env entry is a standalone-run fallback, not the primary path. If that fallback has no auth, you get the failure below.
Auth / 401 Unauthorized. A 401 means the connection reached the server and was rejected for missing/wrong credentials — the path is correct, so do not strip /mcp (that would give 404, a different error). Attach credentials instead:
- request-supplied servers: put them in
headers(e.g.{"Authorization": "Bearer …"}or a server-specific header likex-litellm-api-key). - env-supplied servers: set
DRA_MCP_BEARER=<token>— it's attached asAuthorization: Bearer <token>to every server that doesn't already carry explicit auth.
To keep bare local runs from attempting an auth-less connect at all, leave DRA_MCP_URL unset and rely on the backend to inject mcp_config.
/mcp path rule differs by source. Under mcp_config, url is treated as a base and /mcp is appended for you — pass the url without /mcp. Under DRA_MCP_URL / mcp_servers, the url is used as given except that a bare host gets /mcp appended; a url that already has a path is left untouched — so pass the full url with /mcp.
Other guards.
- Connect to
127.0.0.1, never0.0.0.0(bind address — dialing it fails). Config normalizes0.0.0.0→ loopback defensively. - Each call is bounded by a shared semaphore (
mcp_max_concurrency) so the agent's fan-out can't exhaust the server's file descriptors; 429s back off and retry withinmcp_rate_limit_max_waitrather than failing immediately. - SSRF guard: only
http(s)schemes are allowed and link-local / cloud-metadata targets are refused. Loopback / private hosts are allowed (the internal gateway uses them). - Connection failures emit
status: mcp_error(with detail) instead of failing silently — one unreachable server does not take down the others or the run. - A FAILED tool call never kills the run: the error is returned to the model as the tool result with retry guidance, classified
permanent(validation / unknown names — fix the arguments, never retry) vstransient(one retry ok). Servers can tag explicitly by prefixing the error message with[permanent]/[transient]; an identical retry of a permanently-failed call is answered locally without hitting the server.
- Code execution. Set
LLM_SANDBOX_URL(+LLM_SANDBOX_TOKEN) to attach an llm-sandbox sidecar; deepagents'executetool then runs real shell / Python / JS in the container, so the model computes aggregates and joins instead of doing arithmetic in its head. With no sandbox configured the agent falls back to an in-memory backend and execution is disabled — it degrades gracefully and says so rather than faking output. - Large-result offload. When a single MCP result exceeds
DRA_MAX_RESULT_CHARS/DRA_MAX_RESULT_ROWS, the full payload is written to a file underDRA_OFFLOAD_DIRand only a compact stub (path, row count, columns, head) enters context; the model reads the file back withexecute. Without a sandbox these bounds become hard truncation caps instead. This is how a large cross-entity scan stays within the context window. - Budgets.
BudgetMiddlewareenforces cumulative per-run ceilings —DRA_MAX_TOOL_CALLSandDRA_MAX_TOTAL_TOKENS— emitting abudget_softwrap-up nudge at 75% and abudget_halthard stop at 100%.DRA_RECURSION_LIMITseparately caps orchestrator super-steps. Theusageevent reports the run's spend against these ceilings at the end.
src/deep_research_agent/
agent.py make_graph(config) factory ← langgraph.json entrypoint
config.py env + per-run config (the portability seam)
models.py OpenAI-compatible model builder
events.py event protocol + tool instrumentation (mcp_call/mcp_result)
prompts.py orchestrator + subagent prompts (citation + MCP-source rules)
citations.py output middleware → final_report + sources[]
completion.py force-completion middleware (no premature ReAct termination)
findings_gate.py sub-agent findings gate — JSON contract, validator, bounce (report_gate's twin)
budget.py BudgetMiddleware — hard tool-call + token ceilings (soft nudge → hard stop)
clarify_fallback.py emits clarification event when a model narrates questions in prose
skill_usage.py emits a skill event the first time each skill is read in a turn
turn.py scopes thread messages to the current turn (multi-turn safety)
report_hygiene.py deterministic scrub + citation lint applied to the final report
report_gate.py report quality gate — bounces a report back once for fixable defects
metering.py per-run usage ledger → usage event + RESEARCH USAGE log
sandbox.py wires the execute / filesystem tools to the llm-sandbox sidecar
tools/search.py Tavily web_search, emits search events
tools/mcp.py MCP loader + per-call instrumentation, concurrency + 429 backoff
tools/clarify.py request_clarification tool → clarification event
tools/report.py submit_report tool — the single explicit deliverable → report event
tools/custom.py CustomTool base class + drop-in loader for custom_tools/
custom_tools/ drop-in deployment-specific tools (CustomTool subclasses), auto-loaded
skills/ agent skills (each a folder with SKILL.md), mounted read-only at /skills/
examples/client.py reference SSE consumer