crosscheck-agent

Confer with multiple LLMs from inside Claude Code. crosscheck-agent is a compact MCP server that lets Claude ask peers from other model families (GPT, Grok, Gemini, Mistral, Groq, DeepSeek) to reason, debate, plan, and peer-review — then hands the synthesised answer back to Claude.

The server is Python, stdlib-only — no external dependencies, no build step. (Earlier versions shipped TypeScript/Rust/Perl mirrors; those have been dropped to remove the 4× maintenance tax. Python is canonical.)

       ┌──────────────┐
Claude │ Claude Code  │     MCP      ┌────────────────────────┐
 tool  │  (your IDE)  │ ───────────▶ │  crosscheck-agent MCP  │
 call  │              │   stdio      │       (Python)         │
       └──────────────┘              └───────────┬────────────┘
                                                 │ HTTPS
                                                 ▼
                         ┌───────────────────────────────────────┐
                         │ OpenAI · xAI · Gemini · Mistral · Groq │
                         │ DeepSeek · Anthropic (…)              │
                         └───────────────────────────────────────┘

Tools

Tool	What it does
`list_providers`	Discover which LLMs are currently available and which are in the active set.
`confer`	Ask one or more providers the same question in parallel; return every answer.
`debate`	Bounded round-trip debate; the configured moderator synthesises a result. Optional `structured: true` makes the synthesis a JSON-schema-validated object (consensus, dissent, key claims, citations, open questions).
`plan`	Collaborative step-by-step planning with risks + alternatives. Honours `structured: true`.
`review`	Peer code / proposal review across one or more LLMs. Pass `untrusted_input: true` when the snippet may contain prompt injections.
`coordinate`	Structured Proposer → Critic(s) → Synthesizer flow. Each role emits a JSON envelope; the synthesizer emits a validated `StructuredSynthesis`. Persists key claims to the SQLite claim-list with `supports`/`attacks` edges when `session_id` is provided.
`triangulate`	Run a coordinate flow and return consensus + minority report plus per-provider weights drawn from accumulated bench / critic-ballot win-rate. The "give me one answer, but be honest about disagreement" tool.
`delegate`	Cross-model handshake: route a `confer` or `review` call through one named provider, with quota tracking by `session_id` and by requesting provider. Refused calls return `accepted: false` with a structured `reason` and the live quota envelope.
`bench`	Run repo-scoped goldens (`*.json` in `.crosscheck/goldens/`) against a panel; rule-based verifiers (contains / regex_match / contains_any / contains_all / not_contains / min_length) score each provider, and the win-rate feeds `triangulate`'s weights.
`solve`	Iterative propose → verify → retry. Provider drafts a literal solution; a `shell` (sandboxed subprocess: timeout, RLIMIT_AS, RLIMIT_CPU, isolated tmpdir) or `regex_response` verifier accepts or rejects; failures feed back to the next attempt. Pass `target_path` to also get a unified-diff `patch` preview (file is never modified).
`fetch`	Retrieve a URL with deny-by-default allowlist (`fetch.url_allowlist` prefix list) and persist a sha256-keyed snapshot under `.crosscheck/evidence/`. Cached on repeat unless `force_refresh: true`. Use to ground claims with reproducible evidence.
`pick`	Multi-criteria decision-making. Each provider scores every option on every criterion (0..1); the tool aggregates with criterion weights, returns a ranked list, and surfaces the top-K cross-provider disagreements as `dissent_deltas`.
`scoreboard`	Read-only snapshot: per-provider weight + wins/losses/abstains + delegations, plus `totals` for sessions/claims/links/delegations and (optional) the last N redacted event lines. The data the UI panel reads.
`update_crosscheck`	Compares your local git HEAD against `main` at https://github.com/fxspeiser/crosscheck-agent. With `apply: true`, runs `git pull --ff-only` in the install directory; the server can't reload itself, so the response asks you to restart Claude Code. The first crosscheck call per server process runs the same check (cached 6h) and attaches an `update_notice` to the result so Claude can offer the upgrade proactively.

Ad-hoc panels

confer, debate, plan, and review all accept an optional providers array so you can assemble a panel on the fly instead of using the configured active set. Some useful patterns from inside Claude Code:

# "I want a fast second opinion from Grok only, skip everyone else."
confer(question="…", providers=["xai"])

# "Pit GPT against Gemini, let OpenAI moderate."
debate(topic="…", providers=["openai", "gemini"], moderator="openai")

# "Review this diff with just my local-fast models."
review(snippet="…", providers=["groq", "deepseek"])

# "Plan the migration; I want every model in the house."
plan(goal="…", providers=["anthropic","openai","xai","gemini","mistral","groq","deepseek"])

Not sure what's wired up? Call list_providers first — it returns every known provider, whether it has an API key in .env, and whether it's in the configured active set. If you ask for a provider that has no key, crosscheck returns a structured error telling you exactly what's missing so Claude can self-correct.

Every run obeys the limits in crosscheck.config.json:

max_rounds — hard cap on unsupervised round trips.
token_cap — total token budget spread across providers × rounds.
max_time_seconds — wall-clock deadline enforced per run.
cache.{enabled,ttl_seconds,max_entries,dir} — SHA256 exact-match disk cache for provider responses. Cached calls return with cache_hit: true and elapsed_ms: 0. LRU eviction at write time; default TTL 7 days.
retries.{max_attempts,backoff_base_s} — jittered exponential backoff on transient HTTP errors (429, 5xx, network, timeout). Honours upstream Retry-After.
rate_limits.{<provider>|default}.{capacity,refill_per_sec} — per-provider leaky-bucket rate limiter.
redaction.{enabled,patterns_extra} — regex scrub for emails, IPv4, AWS keys, GitHub PATs, Slack tokens, OpenAI keys, bearer tokens, 16-digit cards. Applied recursively to ndjson event records and to JSON transcripts at write time.
provider_allowlist — null = no restriction; otherwise the array is the only set of providers that may run, even when callers ask for others. Blocked providers surface in blocked_by_allowlist.
events_log — path to the ndjson event trace (one structured event per tool_start / provider_call / tool_end). Use scripts/replay to inspect.
delegation.{max_per_session,max_per_requester} — quota knobs for the delegate tool.
bench.goldens_dir — directory holding bench fixtures.

Every tool result includes a budget block (wall_used_ms, wall_remaining_ms, cache_hits, provider_calls). When session_id is supplied, a session block carries cumulative {calls, wall_ms, cache_hits} across calls — backed by SQLite at .crosscheck/sessions.sqlite3 (override via session_db).

Asking Claude to use it

Once the MCP server is registered, you don't call the tools by hand — Claude does, based on what you say. A few prompts that work well inside Claude Code:

Quick sanity check across the panel

"Confer with the panel: is a uuid.v7() primary key a bad idea for a high-write Postgres table?"

Pick a specific model on the fly

"Ask Grok only — what's the cheapest way to shard this Redis cluster?"

"Just confer with GPT and Gemini on whether this regex is ReDoS-safe."

Debate between specific peers

"Debate this with OpenAI and Gemini, let Claude moderate: should we use Server-Sent Events or WebSockets for the notification stream?"

Plan with a hand-picked team

"Plan the auth migration with Anthropic, OpenAI, and xAI. Constraint: zero downtime, Postgres-backed sessions, 3M active users."

Peer code review

"Have Groq and DeepSeek review this migration for race conditions: <paste SQL>"

Discover what's wired up

"List the providers crosscheck has available and tell me which ones are missing an API key."

Structured coordination (Proposer → Critic → Synthesizer)

"Coordinate this with anthropic, openai, and gemini, session_id=auth-rewrite-1: 'should we move from JWT to opaque tokens for our internal-API auth?'"

Triangulate when you want consensus + dissent

"Triangulate across the panel: what's the right batch size for our embedding pipeline given a 16GB GPU and 4M docs?"

Cross-model delegation

"Delegate this code review to xai with requested_by=anthropic, session_id=migrations-2: paste-the-SQL-here."

Bench the panel

"Run bench against alpha, beta, gamma using the goldens in .crosscheck/goldens/ and rank them."

Self-update

"Check whether crosscheck-agent has an update." (if Claude already saw an update_notice on a previous tool call, it will surface it without prompting.)

"Yes, upgrade it." → Claude calls update_crosscheck with apply: true, then reminds you to restart Claude Code so the new server code loads.

Replay the event log

scripts/replay --tail 50                # last 50 events
scripts/replay --tool coordinate        # only coordinate events
scripts/replay --provider gemini        # all Gemini calls
scripts/replay --kind provider_call --since 5m

Claude will call list_providers, confer, debate, plan, or review under the hood, pass the subset you named, and stream the responses back. If you name a provider that isn't configured, crosscheck returns a structured error so Claude can ask you what to do instead of guessing.

Quick start

git clone https://github.com/<you>/crosscheck-agent.git
cd crosscheck-agent

# 1. Interactive setup — writes .env + crosscheck.config.json
bash scripts/setup.sh

# 2. Sanity-check the server starts
python3 servers/python/crosscheck_server.py

# 3. Register with Claude Code
claude mcp add crosscheck -- python3 "$PWD/servers/python/crosscheck_server.py"

Then inside Claude Code:

/mcp
# call confer / debate / plan / review

Requires Python 3.10+. No pip install needed.

Tuning limits from the terminal

The scripts/crosscheck CLI edits crosscheck.config.json in place.

scripts/crosscheck config show
scripts/crosscheck config set max_rounds 5
scripts/crosscheck config set token_cap 16000
scripts/crosscheck config set max_time_seconds 300
scripts/crosscheck config set providers anthropic,openai,xai,gemini
scripts/crosscheck config set moderator openai

scripts/crosscheck providers list
scripts/crosscheck providers enable gemini
scripts/crosscheck providers disable xai

scripts/crosscheck doctor     # audit: keys present, config sane

Optionally add this line to your shell rc file to make the CLI globally available:

export PATH="$PATH:/path/to/crosscheck-agent/scripts"

Providers

Provider	Env var	Default model	Endpoint
Anthropic	`ANTHROPIC_API_KEY`	`claude-opus-4-5`	native
OpenAI	`OPENAI_API_KEY`	`gpt-5`	Chat Completions
xAI (Grok)	`XAI_API_KEY`	`grok-4-latest`	OpenAI-compatible
Google	`GEMINI_API_KEY`	`gemini-2.5-pro`	Gemini API
Mistral	`MISTRAL_API_KEY`	`mistral-large-latest`	OpenAI-compatible
Groq	`GROQ_API_KEY`	`llama-3.3-70b-versatile`	OpenAI-compatible
DeepSeek	`DEEPSEEK_API_KEY`	`deepseek-chat`	OpenAI-compatible

Any provider without a key in .env is silently skipped.

Configuration

crosscheck.config.json:

{
  "max_rounds": 3,
  "token_cap": 8000,
  "max_time_seconds": 120,
  "providers": ["anthropic", "openai", "xai"],
  "moderator": "anthropic",
  "temperature": 0.4,
  "log_transcripts": true,
  "transcript_dir": ".crosscheck/transcripts"
}

When log_transcripts is on, every conferral / debate is persisted as JSON under .crosscheck/transcripts/ (also gitignored).

Picking a sensible `token_cap`

token_cap is the total completion-token budget for a single tool call, split across providers × rounds. The default (60000, for coding work) gives each call ~6.6k tokens with the default 3 rounds × 3 providers.

One gotcha worth knowing: OpenAI's GPT-5 and o-series are reasoning models. crosscheck-agent reserves max_completion_tokens per call, and OpenAI counts that reservation against your tier's per-request / per-minute limit before the call runs. On lower usage tiers, a 6.6k reservation can trip a 429. If you see HTTP 429: exceeded your current quota on OpenAI only (Anthropic + xAI still work), drop the cap:

scripts/crosscheck config set token_cap 20000   # ~2.2k per call

Or raise your OpenAI usage tier. crosscheck-agent automatically uses max_completion_tokens and omits temperature when the model name starts with gpt-5, o1, o3, or o4.

Layout

crosscheck-agent/
├── .env.example
├── crosscheck.config.example.json
├── scripts/
│   ├── setup.sh            # interactive wizard
│   └── crosscheck          # config + providers CLI
└── servers/
    └── python/             # stdlib-only MCP server (canonical)

Security

.env is gitignored. The setup wizard chmods it to 600.
The crosscheck CLI never prints API keys, only whether they exist.
Keys are read at startup and never written anywhere except stderr on an HTTP error (which may echo the remote error payload — be mindful if you pipe logs to third-party tools).

Contributing

Issues and PRs welcome. Keep the tool surface (confer, debate, plan, review, list_providers) stable and dependency-light. Python stdlib only.

Credits

Built by Frank Speiser with pair-programming assistance from Claude (Anthropic). Mistakes are Frank's; good ideas are shared.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crosscheck-agent

Tools

Ad-hoc panels

Asking Claude to use it

Quick start

Tuning limits from the terminal

Providers

Configuration

Picking a sensible `token_cap`

Layout

Security

Contributing

Credits

License

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
schema		schema
scripts		scripts
servers/python		servers/python
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crosscheck.config.example.json		crosscheck.config.example.json

Folders and files

Latest commit

History

Repository files navigation

crosscheck-agent

Tools

Ad-hoc panels

Asking Claude to use it

Quick start

Tuning limits from the terminal

Providers

Configuration

Picking a sensible token_cap

Layout

Security

Contributing

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Picking a sensible `token_cap`

Packages