Skip to content

SebastianElvis/senate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

senate

The Senate — multi-agent debate between coding CLIs

Multi-agent debate skills for coding CLIs. Orchestrates codex, gemini, cursor, kimi, and claude through structured debate formats — parliament, court, red-team, peer-review, committee, brainstorm — to reach more robust answers than any single model.

Background

Multi-agent debate is a well-studied technique for improving LLM reasoning: independent agents propose, critique, and revise answers under a structured protocol. Results are protocol-dependent — strong single-agent prompting can match it on some benchmarks — but a substantial body of work reports gains in factuality, divergent thinking, evaluation quality, and truthfulness.

senate ports the protocols humans already use to coordinate disagreement — parliaments, courts, peer review, RFCs — and packages them as agent skills you can run across heterogeneous CLIs. See dev/PRODUCT.md for the full thesis.

Install

Prerequisite: the CLIs you want to put in debates must be installed and authenticated locally — senate shells out to them. Each skills/invoke-agent/references/<cli>.md has a paste-ready install check for codex, gemini, cursor, kimi, and claude.

This repo ships as a Claude Code plugin and as a cross-agent bundle via the skills CLI (works with most coding agents that load skills — Claude Code, Codex, Cursor, OpenCode, Gemini CLI, …).

# Claude Code plugin
/plugin marketplace add SebastianElvis/senate
/plugin install senate@senate

# Any host agent
npx skills add SebastianElvis/senate

Useful flags for the skills CLI:

Option Description
-g, --global Install globally (~/<agent>/skills/) instead of the current project
-a, --agent <name...> Target specific agents (claude-code, codex, cursor, opencode, …)
-s, --skill <name...> Install a subset (--skill senate)
-l, --list List skills without installing
-y, --yes Skip prompts

Other commands: npx skills list | find <q> | update senate | remove senate. Source forms accept GitHub shorthand, full URLs, git URLs, or local paths.

Usage

Ask your host agent for a debate in plain language:

  • "Run a parliament between codex, gemini, and kimi on whether to migrate this service to Rust."
  • "Hold a court debate — codex prosecutes my refactor, claude defends, gemini judges."
  • "Committee of three models drafts this API design."
  • "Red-team this deployment plan."
  • "Peer-review this design doc."
  • "Run a draft-review-finalize pipeline on this spec." (multi-stage)
  • "Which format should I use?" (the planner recommends one without running)

Run artifacts land in <cwd>/.senate/runs/<id>/ — never in this skill repo. End-to-end walk-throughs of the headline cases live in examples/.

Architecture

Five skills compose one debate lifecycle:

              user request
                   │
                   ▼
         ┌───────────────────┐
         │      senate       │   mints .senate/runs/<id>/
         │   (orchestrator)  │
         └─────────┬─────────┘
                   ▼
         ┌───────────────────┐
         │   debate-agenda   │ ──▶ agenda.md
         │     (planner)     │
         └─────────┬─────────┘
                   ▼
         ┌───────────────────┐  dispatches   ┌──────────────────┐
         │  moderate-debate  │ ─────────────▶│ per-turn subagent│
         │    (moderator)    │ ◀─────────────│ + invoke-agent   │
         └─────────┬─────────┘  result       └────────┬─────────┘
                   │ appends                          │ shells out
                   │   ▶ transcript.jsonl             ▼
                   │   ▶ context.md          codex · gemini · cursor
                   │   ▶ agents/<cli>.md     kimi  · claude
                   ▼
         ┌───────────────────┐ ──▶ notes.md
         │   meeting-note    │
         │     (scribe)      │
         └───────────────────┘

moderate-debate dispatches every turn into a fresh per-turn subagent that loads the relevant invoke-agent playbook, shells out to the CLI, validates the contract, and returns only a structured result. Multi-stage pipelines are expanded once by the planner into a single agenda.md; the moderator then runs each stage under stages/<N>-<name>/, calling back to the planner only for clarification or mid-run re-planning. meeting-note consolidates after the final stage.

Skill Purpose
senate Top-level entry. Mints the run dir; routes through the lifecycle.
debate-agenda Picks the format and roster, sequences pipeline stages, asks for clarification. Hosts formats at formats/ and pipelines in references/stages.md.
moderate-debate Drives turns by dispatching per-turn subagents; commits transcript/context; handles failures and checkpoints.
meeting-note Reads agenda + transcript + context + verdicts; writes the user-facing notes.md.
invoke-agent Per-CLI playbooks (codex, gemini, cursor, kimi, claude) loaded inside per-turn subagents.

Every skill follows the Agent Skills spec: a SKILL.md plus on-demand references/. The evals/ directory is a sibling harness, not a shipped skill.

Run-dir layout

.senate/runs/<id>/
  agenda.md            # the plan
  context.md           # shared scratchpad (delta-only)
  transcript.jsonl     # canonical per-turn record (errors live here as codes)
  state.json           # status, used for resume
  notes.md             # single user-facing summary
  bindings.json        # multi-stage only
  agents/
    moderator.md       # governance log
    <cli>.md           # per-CLI private memory
  stages/<n>-<name>/
    verdict.md
    turns/<NNN>-<cli>-<role>/
      prompt.derived.md
      stdout.log       # always present (may be empty on failure)
      stderr.log       # only if non-empty
      reply.md

Single-stage runs get exactly one stages/ entry. Full schema in skills/senate/references/workspace.md.

Evaluating

evals/ runs fixture debates end-to-end and grades them on two tiers: deterministic schema/contract checks against the run-dir, plus LLM judges (notes, agenda, transcript-quality) invoked via claude -p. A separate pairwise judge does A/B comparisons between two completed runs of the same fixture (counterbalanced for position bias) — used explicitly when comparing skill edits, not as a default fixture rubric. No API key needed — judges use your Claude Code OAuth.

evals/run.sh                                          # all fixtures
evals/run.sh evals/fixtures/_smoke-parliament.md      # cheapest fixture
python3 evals/scripts/report.py                       # rollup

Scorecard rows record repo_commit, fixture_sha256, and claude_cli_version for reproducibility. Stub-CLI replay is available for fast CI; methodology follows Demystifying evals for AI agents.

Extending

  • New CLI — drop skills/invoke-agent/references/<name>.md modeled on an existing file.
  • New format — add skills/debate-agenda/formats/<name>.md only when it owns an interaction-contract axis no existing format owns; then add a row to formats/README.md.
  • New pipeline — add a recipe to skills/debate-agenda/references/stages.md referencing existing formats, and a row to formats/README.md.

No code to write. Markdown all the way down.

Roadmap

See dev/PRODUCT.md for the vision, design principles, and H0–H7 horizon plan.

References

Theoretical foundations this skill bundle builds on.

Multi-agent debate as an LLM technique:

  • Du et al. 2023, Improving Factuality and Reasoning in Language Models through Multiagent DebatearXiv:2305.14325
  • Liang et al. 2023, Encouraging Divergent Thinking in Large Language Models through Multi-Agent DebatearXiv:2305.19118
  • Chan et al. 2023, ChatEval: Towards Better LLM-based Evaluators through Multi-Agent DebatearXiv:2308.07201
  • Khan et al. 2024, Debating with More Persuasive LLMs Leads to More Truthful AnswersarXiv:2402.06782

Single-agent precursors and limits of debate:

  • Wang et al. 2022, Self-Consistency Improves Chain of Thought Reasoning in Language ModelsarXiv:2203.11171. The single-agent precursor — sampling multiple reasoning paths from one model — that debate generalizes across models.
  • Wang et al. 2024, on the limits of multi-agent discussion vs. strong single-agent prompting — arXiv:2402.18272. Frames when debate is worth the cost.

Adjacent foundations for multi-agent LLM systems:

  • Park et al. 2023, Generative Agents: Interactive Simulacra of Human BehaviorarXiv:2304.03442. Role-playing agents under structured protocols.

License

MIT

About

Multi-agent debate skills for coding CLIs (parliament, court, consensus)

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors