Autonomous experiment loop for Claude Code. Try an idea, measure it, keep what works, discard what doesn't, repeat forever. Built in Ruby, targeted at Rails build optimization but works for any workload.
Not affiliated with Anthropic. Claude Code is a product of Anthropic, PBC. This plugin is an independent extension.
Ruby port of davebcn87/pi-autoresearch — the original pi extension by @davebcn87, inspired by karpathy/autoresearch. The design (session files, tool surface, keep/revert loop, confidence scoring) is his; the Claude Code integration and Ruby implementation are this project's.
A handful of rules shape every iteration:
- One atomic change per iteration. The agent edits, commits, then calls
run_experiment. Each row inautoresearch.jsonlmaps to exactly one commit. - Mechanical verification only. Benchmarks emit
METRIC name=valuelines on stdout. No subjective "did it get better?" — the number decides. - Failed runs auto-revert.
discard,crash, andchecks_failedreset HEAD to the pre-run commit and clean the working tree. Session files (autoresearch.*) are preserved. - Correctness can gate
keep. Ifautoresearch.checks.shexists, it runs after each passing benchmark. A failure blockskeepunless explicitly forced. - JSONL is the source of truth. Append-only, segment-aware, rebuilt on every tool call — resume-after-crash is free.
- Confidence is advisory. MAD-based scoring flags noisy improvements; it never auto-discards.
- Bounded by
maxIterations. Loops stop at the cap (default 50). Set it inautoresearch.config.json.
| MCP server | Typed tools: init_experiment, run_experiment, log_experiment |
| Skills | autoresearch-create (set up a session and start the loop), autoresearch-finalize (split kept experiments into clean branches) |
| Slash commands | /autoresearch <text>, /autoresearch off, /autoresearch clear, /autoresearch export, /autoresearch dashboard |
| Hooks | Auto-inject autoresearch.md into prompts, auto-resume the loop, session-start warnings |
| Statusline | Live run count, best metric, confidence score |
| Dashboard | Browser dashboard at http://localhost:8765 (Rack + Puma); set AUTORESEARCH_HOST=0.0.0.0 to view from a phone on the same LAN |
Requirements: Ruby ≥ 3.1 with Bundler on your PATH.
From a Claude Code session:
/plugin marketplace add https://github.com/marcelopazzo/claude-code-autoresearch
/plugin install claude-code-autoresearch@claude-code-autoresearch
The first command registers this repo as a single-plugin marketplace; the second installs the plugin from it. The MCP server self-installs its Ruby gems on first launch — no manual bundle install needed. If self-install fails (e.g., no network on first run), mcp/install.sh inside the installed plugin is the manual fallback.
Claude Code doesn't let plugins ship a statusLine directly — add it yourself to ~/.claude/settings.json:
{
"statusLine": {
"type": "command",
"command": "~/.claude/plugins/claude-code-autoresearch/statusline/autoresearch.rb"
}
}(Adjust the path to match what /plugin reports as the install location.)
Copy the snippet from keybindings.example.json into ~/.claude/keybindings.json to match the upstream pi-autoresearch keybindings.
Reload the session so the MCP server, hooks, and (if you added it) statusline are picked up.
/autoresearch optimize bundle exec rake test runtime, monitor test pass rate
The autoresearch-create skill asks a few questions (or infers from context), writes autoresearch.md + autoresearch.sh, runs the baseline, and starts looping. Example autoresearch.sh for a Rails test suite:
#!/bin/bash
set -euo pipefail
START=$(date +%s.%N)
bundle exec rake test > /tmp/ar_test.log 2>&1
END=$(date +%s.%N)
echo "METRIC total_seconds=$(echo "$END - $START" | bc)"
echo "METRIC failures=$(grep -c 'Failure:' /tmp/ar_test.log || true)"The loop runs autonomously: edit → commit → run_experiment → log_experiment → keep or revert → repeat. Interrupt any time with Esc.
/autoresearch-finalize
Groups kept commits into independent branches from the merge-base. Each branch touches a disjoint set of files and can be reviewed and merged on its own.
| File | Purpose |
|---|---|
autoresearch.md |
Session document — objective, metrics, files in scope, what's been tried. A fresh agent resumes from this alone. |
autoresearch.sh |
Benchmark script — pre-checks, runs the workload, prints METRIC name=number lines. |
autoresearch.jsonl |
Append-only log of every run. Human readable, branch-aware, survives restarts. |
autoresearch.checks.sh |
(optional) Correctness checks (tests, types, lint). Runs after each passing benchmark. Failures block keep. |
autoresearch.ideas.md |
(optional) Backlog of promising ideas not yet tried. |
autoresearch.config.json |
(optional) { "maxIterations": 50, "workingDir": "..." } |
After 3+ experiments in a session, autoresearch computes a confidence score — how the best improvement compares to the session's noise floor (Median Absolute Deviation).
| Confidence | Meaning |
|---|---|
| ≥ 2.0× | Improvement is likely real |
| 1.0–2.0× | Above noise but marginal |
| < 1.0× | Within noise — consider re-running |
Shown in the statusline and on every log_experiment response. Advisory only — never auto-discards.
Autoresearch loops run autonomously and can burn tokens:
- API key limits — set budgets in your Anthropic console.
maxIterations— cap experiments per session inautoresearch.config.json.
- @davebcn87 — original pi-autoresearch design and reference implementation.
- Andrej Karpathy — the autoresearch idea this builds on.
- Tobi Lütke — co-author of upstream pi-autoresearch.
MIT. See LICENSE.
