claude-code-autoresearch

Autonomous experiment loop for Claude Code. Try an idea, measure it, keep what works, discard what doesn't, repeat forever. Built in Ruby, targeted at Rails build optimization but works for any workload.

Not affiliated with Anthropic. Claude Code is a product of Anthropic, PBC. This plugin is an independent extension.

Ruby port of davebcn87/pi-autoresearch — the original pi extension by @davebcn87, inspired by karpathy/autoresearch. The design (session files, tool surface, keep/revert loop, confidence scoring) is his; the Claude Code integration and Ruby implementation are this project's.

How the loop works

A handful of rules shape every iteration:

One atomic change per iteration. The agent edits, commits, then calls run_experiment. Each row in autoresearch.jsonl maps to exactly one commit.
Mechanical verification only. Benchmarks emit METRIC name=value lines on stdout. No subjective "did it get better?" — the number decides.
Failed runs auto-revert. discard, crash, and checks_failed reset HEAD to the pre-run commit and clean the working tree. Session files (autoresearch.*) are preserved.
Correctness can gate keep. If autoresearch.checks.sh exists, it runs after each passing benchmark. A failure blocks keep unless explicitly forced.
JSONL is the source of truth. Append-only, segment-aware, rebuilt on every tool call — resume-after-crash is free.
Confidence is advisory. MAD-based scoring flags noisy improvements; it never auto-discards.
Bounded by maxIterations. Loops stop at the cap (default 50). Set it in autoresearch.config.json.

What's included


MCP server	Typed tools: `init_experiment`, `run_experiment`, `log_experiment`
Skills	`autoresearch-create` (set up a session and start the loop), `autoresearch-finalize` (split kept experiments into clean branches)
Slash commands	`/autoresearch <text>`, `/autoresearch off`, `/autoresearch clear`, `/autoresearch export`, `/autoresearch dashboard`
Hooks	Auto-inject `autoresearch.md` into prompts, auto-resume the loop, session-start warnings
Statusline	Live run count, best metric, confidence score
Dashboard	Browser dashboard at `http://localhost:8765` (Rack + Puma); set `AUTORESEARCH_HOST=0.0.0.0` to view from a phone on the same LAN

Install

Requirements: Ruby ≥ 3.1 with Bundler on your PATH.

1. Add the marketplace and install the plugin

From a Claude Code session:

/plugin marketplace add https://github.com/marcelopazzo/claude-code-autoresearch
/plugin install claude-code-autoresearch@claude-code-autoresearch

The first command registers this repo as a single-plugin marketplace; the second installs the plugin from it. The MCP server self-installs its Ruby gems on first launch — no manual bundle install needed. If self-install fails (e.g., no network on first run), mcp/install.sh inside the installed plugin is the manual fallback.

2. (Optional) Enable the statusline

Claude Code doesn't let plugins ship a statusLine directly — add it yourself to ~/.claude/settings.json:

{
  "statusLine": {
    "type": "command",
    "command": "~/.claude/plugins/claude-code-autoresearch/statusline/autoresearch.rb"
  }
}

(Adjust the path to match what /plugin reports as the install location.)

3. (Optional) Rebind Ctrl+X to the dashboard

Copy the snippet from keybindings.example.json into ~/.claude/keybindings.json to match the upstream pi-autoresearch keybindings.

4. Restart Claude Code

Reload the session so the MCP server, hooks, and (if you added it) statusline are picked up.

Quick start (Rails build optimization)

/autoresearch optimize bundle exec rake test runtime, monitor test pass rate

The autoresearch-create skill asks a few questions (or infers from context), writes autoresearch.md + autoresearch.sh, runs the baseline, and starts looping. Example autoresearch.sh for a Rails test suite:

#!/bin/bash
set -euo pipefail
START=$(date +%s.%N)
bundle exec rake test > /tmp/ar_test.log 2>&1
END=$(date +%s.%N)
echo "METRIC total_seconds=$(echo "$END - $START" | bc)"
echo "METRIC failures=$(grep -c 'Failure:' /tmp/ar_test.log || true)"

The loop runs autonomously: edit → commit → run_experiment → log_experiment → keep or revert → repeat. Interrupt any time with Esc.

Finalize into reviewable branches

/autoresearch-finalize

Groups kept commits into independent branches from the merge-base. Each branch touches a disjoint set of files and can be reviewed and merged on its own.

Files created in your project

File	Purpose
`autoresearch.md`	Session document — objective, metrics, files in scope, what's been tried. A fresh agent resumes from this alone.
`autoresearch.sh`	Benchmark script — pre-checks, runs the workload, prints `METRIC name=number` lines.
`autoresearch.jsonl`	Append-only log of every run. Human readable, branch-aware, survives restarts.
`autoresearch.checks.sh`	(optional) Correctness checks (tests, types, lint). Runs after each passing benchmark. Failures block `keep`.
`autoresearch.ideas.md`	(optional) Backlog of promising ideas not yet tried.
`autoresearch.config.json`	(optional) `{ "maxIterations": 50, "workingDir": "..." }`

Confidence scoring

After 3+ experiments in a session, autoresearch computes a confidence score — how the best improvement compares to the session's noise floor (Median Absolute Deviation).

Confidence	Meaning
≥ 2.0×	Improvement is likely real
1.0–2.0×	Above noise but marginal
< 1.0×	Within noise — consider re-running

Shown in the statusline and on every log_experiment response. Advisory only — never auto-discards.

Controlling costs

Autoresearch loops run autonomously and can burn tokens:

API key limits — set budgets in your Anthropic console.
maxIterations — cap experiments per session in autoresearch.config.json.

Credits

@davebcn87 — original pi-autoresearch design and reference implementation.
Andrej Karpathy — the autoresearch idea this builds on.
Tobi Lütke — co-author of upstream pi-autoresearch.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.claude-plugin		.claude-plugin
.github		.github
commands		commands
docs		docs
hooks		hooks
mcp		mcp
skills		skills
statusline		statusline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
keybindings.example.json		keybindings.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-code-autoresearch

How the loop works

What's included

Install

1. Add the marketplace and install the plugin

2. (Optional) Enable the statusline

3. (Optional) Rebind Ctrl+X to the dashboard

4. Restart Claude Code

Quick start (Rails build optimization)

Finalize into reviewable branches

Files created in your project

Confidence scoring

Controlling costs

Credits

License

About

Uh oh!

Releases 4

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

claude-code-autoresearch

How the loop works

What's included

Install

1. Add the marketplace and install the plugin

2. (Optional) Enable the statusline

3. (Optional) Rebind Ctrl+X to the dashboard

4. Restart Claude Code

Quick start (Rails build optimization)

Finalize into reviewable branches

Files created in your project

Confidence scoring

Controlling costs

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Contributors

Uh oh!

Languages