AI Experiment / Showcase — This project is built for educational and research purposes. It demonstrates how multiple AI models can be orchestrated into structured consensus processes. Not intended for production decision-making.
Put multiple AI models in a room. Give them personas. Watch them debate.
RoundTable runs the Consensus Validation Protocol (CVP) and a Blind Jury engine across any combination of AI providers — Grok, Claude, GPT, Gemini, Mistral, and more — with configurable personas, a non-voting Judge synthesizer, a live confidence trajectory chart, a disagreement ledger, a cost meter, shareable permalinks, and a premium dark interface designed for long sessions.
RoundTable is an open-source web application that orchestrates structured multi-round debates between AI models. Instead of asking one model and hoping for the best, RoundTable forces multiple models to:
- Analyze a topic independently
- Challenge each other's reasoning
- Assess the strength of evidence presented
- Synthesize a final consensus position
Each model is assigned a persona (Risk Analyst, First-Principles Engineer, Devil's Advocate, etc.) that shapes how it approaches the discussion. The result is a richer, more robust analysis than any single model can produce alone.
No database. No auth. No external services. Just add your API keys and go.
A single language model produces a single distribution over tokens. It has no mechanism to check its own reasoning against an independent perspective. The Consensus Validation Protocol (CVP) addresses this by running multiple models — each constrained to a distinct analytical persona — through a structured sequence of rounds where they must respond to each other's arguments. The goal is not to produce a "correct" answer by majority vote, but to surface disagreements, stress-test reasoning, and force each participant to update its position in light of criticism.
The result is a scored collection of final perspectives, not a merged conclusion. The human reader is the ultimate synthesizer.
CVP runs up to a configured number of rounds (1–10, default 5). Each round has a designated type that constrains what participants are asked to do. From Round 2 onward, participants are processed sequentially within each round — later participants in a round see earlier participants' responses from that same round, in addition to all responses from prior rounds. Round 1 runs in parallel with no cross-visibility by default (toggleable via the "Blind Round 1" option) so the first wave of analysis is not contaminated by whoever happened to answer first.
Round phases:
-
Initial Analysis (Round 1) — Each participant provides an independent analysis of the prompt, shaped by its assigned persona. With "Blind Round 1" enabled (the default) every participant answers in parallel with no visibility into any other participant. Each response must end with a self-assessed confidence score (0–100).
-
Counterarguments (Round 2) — Each participant reviews all Round 1 responses and identifies weaknesses, challenges assumptions, and highlights logical gaps. Confidence scores are updated.
-
Evidence Assessment (Round 3) — Participants evaluate the strength of evidence presented so far, distinguish well-supported claims from speculation, and identify areas of emerging agreement.
-
Synthesis (Rounds 4 through N) — Participants synthesize the discussion, acknowledge remaining uncertainties, and refine their positions. The final round is labeled "Final Synthesis" in the prompt, signaling participants to commit to a concluding position.
Randomised order. From Round 2 onward, participant order is shuffled per round by default to prevent the first-mover from disproportionately framing each round. Toggleable via the "Randomize order" option.
Early stopping. When the consensus score delta between two consecutive rounds drops to ≤ 3 points, the engine emits an early-stop event and terminates the run before exhausting all configured rounds. This is on by default and saves cost on runs that converge quickly. Toggleable via the "Early stop" option.
Persona injection: Each participant's system prompt is prepended with a persona definition (e.g., "You are a Risk Analyst. Your role is to surface hidden dangers, tail risks, and second-order effects."). Personas are defined server-side in lib/personas.ts and cannot be modified by the client.
Confidence extraction: Every response is expected to end with CONFIDENCE: [0-100]. A regex extracts this value. If absent, confidence defaults to 50.
Consensus scoring: After each round, a consensus score is computed:
consensus_score = avg(confidence) - 0.5 * stddev(confidence)
High average confidence with low variance yields a high score. Disagreement (high variance) penalizes the score even if individual confidences are high.
Disagreement detection: After each round the engine scans every pair of participants. Any pair whose confidence diverges by ≥ 20 points is recorded in the disagreement ledger and surfaced live in the UI. The detection is intentionally deterministic and cheap — no extra LLM calls — which makes it robust to rate limits and reproducible across runs.
Judge synthesis (optional): When "Judge synthesis" is enabled, a dedicated non-voting model reads every participant's final-round response and produces a structured synthesis with four sections: Majority Position, Minority Positions, Unresolved Disputes, and Synthesis Confidence. The judge is forbidden from picking a winner or collapsing conditional minority views into the majority. Its output streams live to the UI and is included in all exports.
Cost meter. Every call is attributed to a participant and priced against the client-side table in lib/pricing.ts. The live meter shows total tokens (in/out) and estimated USD; totals include the judge. When the Vercel AI SDK reports token usage, the meter uses it directly; otherwise it falls back to a 4-chars-per-token heuristic.
Provider error resilience. When a participant's underlying provider call fails — wrong base URL, invalid API key, unknown model, upstream outage, 404 from a mismatched endpoint, you name it — the engine catches the error via the Vercel AI SDK's onError callback, formats it with the HTTP status code when available, logs the full error object server-side, and emits a participant-end event with an error field. The client renders that response as a red error card with the upstream message (not the usual content card), fires a toast identifying which provider/model broke, and excludes the errored response from both the consensus score and the disagreement ledger, so one broken provider can no longer tank the run. The remaining participants continue normally.
User Prompt + Round Count + Participant Config
|
v
┌─────────────────────────────────────────────┐
│ Round 1: Initial Analysis │
│ │
│ [Persona A / Model X] ──→ Response + Conf │
│ [Persona B / Model Y] ──→ Response + Conf │ (sequential; B sees A's response)
│ [Persona C / Model Z] ──→ Response + Conf │ (C sees A's and B's responses)
│ │
│ consensus_score = avg(conf) - 0.5*std(conf)│
└─────────────────────┬───────────────────────┘
│ all responses passed forward
v
┌─────────────────────────────────────────────┐
│ Round 2: Counterarguments │
│ │
│ Each participant receives ALL prior round │
│ responses and must challenge assumptions. │
│ Updated confidence scores. │
└─────────────────────┬───────────────────────┘
│
v
┌─────────────────────────────────────────────┐
│ Round 3: Evidence Assessment │
│ │
│ Evaluate evidence quality. │
│ Distinguish supported claims from │
│ speculation. Updated confidence scores. │
└─────────────────────┬───────────────────────┘
│
v
┌─────────────────────────────────────────────┐
│ Rounds 4–N: Synthesis │
│ │
│ Refine positions. Final round prompts for │
│ a concluding stance. Final confidence. │
└─────────────────────┬───────────────────────┘
│
v
┌─────────────────────────────────────────────┐
│ Output │
│ │
│ Final consensus score (last round only) │
│ All individual final-round responses │
│ Per-participant confidence trajectories │
│ No auto-merged conclusion — human reviews │
└─────────────────────────────────────────────┘
RoundTable also ships a Blind Jury engine alongside CVP. Where CVP is a multi-round debate, Blind Jury is a single-pass evaluation:
- Every participant answers the same prompt in parallel, with no cross-visibility into any other answer.
- A judge model synthesizes majority, minority, and unresolved positions from the independent responses.
- A disagreement ledger is computed from the pairwise confidence spread, exactly as in CVP.
Blind Jury is the right engine when you want independent signals rather than a negotiated consensus. Because there is no sequential visibility, it is immune to the anchoring bias that CVP needs randomized order and blind Round 1 to mitigate. It is also cheap: one API call per participant, plus one for the judge.
Switch engines from the sidebar ("Protocol" section). The Blind Jury engine ignores the round count and the CVP-specific toggles.
Majority vote asks N models the same question and picks the most common answer. CVP does something structurally different:
-
Persona diversity forces coverage. A Risk Analyst and an Optimistic Futurist will examine different failure modes and opportunities from the same prompt. This isn't random variation — it's directed exploration of the problem space.
-
Sequential visibility creates dialogue. Because participants within a round see earlier responses, later participants can directly respond to specific claims. This is closer to a structured debate than independent polling.
-
Multi-round iteration forces updating. A model that states high confidence in Round 1 must confront counterarguments in Round 2 and defend or revise in subsequent rounds. The protocol mechanically prevents "fire and forget" responses.
-
Confidence variance detects real disagreement. The consensus score penalizes high-confidence disagreement. If three models each claim 95% confidence but on different conclusions, the score drops. This surfaces cases where naive voting would mask genuine uncertainty.
-
The human sees everything. CVP does not collapse the debate into a single answer. All intermediate reasoning is visible, streamed in real-time. The reader can trace exactly where participants agreed, where they diverged, and why.
Shared hallucinations. If all underlying models share the same training-data blind spot, personas will not fix it. A Risk Analyst running on GPT-4o and a Scientific Skeptic running on GPT-4o share the same parametric knowledge. Cross-provider diversity (e.g., mixing Grok, Claude, and Gemini) partially mitigates this, but cannot eliminate it.
Prompt bias propagation. The user's prompt frames the debate. If the prompt contains a false premise, all participants may accept it. Personas like First-Principles Engineer and Scientific Skeptic are designed to push back, but their effectiveness depends on the model's ability to detect the bias.
Sycophantic convergence. Models still tend to agree with prior responses, especially in later rounds. "Blind Round 1" and "Randomize order" reduce this bias but do not eliminate it — the last participant of any sequential round still sees the most prior context and may anchor to the emerging consensus rather than independently evaluating. Blind Jury avoids this failure mode entirely at the cost of giving up multi-round refinement.
Cost scales linearly. Each participant makes one API call per round. With 4 participants and 5 rounds, that is 20 API calls per consensus run, plus one for the judge if enabled. At 1,500 tokens per response, a single run can consume 30,000+ output tokens across providers. Early stopping and Blind Jury are the easiest levers to lower cost; the live cost meter in the floating run panel makes this concrete during a run.
Confidence scores are self-reported. Models assign their own confidence. There is no calibration, no ground truth, and no penalty for overconfidence. The consensus score is only as meaningful as the models' ability to self-assess — which is known to be unreliable. The judge synthesizer is deliberately not a calibrator: it summarises what was said, it does not grade it.
Disagreement heuristic is confidence-based. The disagreement ledger flags pairs whose confidence diverges by ≥ 20 points. This catches substantive splits reliably but misses cases where two participants hold opposite positions with identical confidence. Treat the ledger as a lower bound on actual disagreement.
This is a simplified reconstruction of how a 3-participant, 3-round run behaves based on the actual code path. The prompt is: "Should a startup use microservices from day one?"
Round 1 — Initial Analysis:
Risk Analyst (Grok): Microservices introduce distributed systems complexity — network partitions, eventual consistency, deployment orchestration — before the team has validated product-market fit. Premature optimization of architecture is a top-5 startup killer. CONFIDENCE: 85
Optimistic Futurist (Claude): Starting with microservices positions the team for scale from day one. The tooling has matured — Kubernetes, service meshes, and managed cloud services reduce the operational burden significantly. CONFIDENCE: 75
First-Principles Engineer (GPT-4o): The question conflates architecture with deployment strategy. A modular monolith can be decomposed later. The real constraint is team size — a 3-person team cannot operate 8 services. CONFIDENCE: 90
Consensus score: 80 (avg=83.3, stddev=6.2)
Round 2 — Counterarguments:
Risk Analyst: The Optimistic Futurist overstates tooling maturity. Kubernetes is not "reduced complexity" — it's traded complexity. Agree with First-Principles on the team-size constraint. Revising slightly upward. CONFIDENCE: 88
Optimistic Futurist: The First-Principles argument about modular monoliths is valid. However, decomposing a monolith later is not free — the argument assumes clean module boundaries that rarely exist in practice. Lowering confidence. CONFIDENCE: 65
First-Principles Engineer: Risk Analyst's point about premature optimization aligns with my position. The Futurist raises a fair concern about decomposition cost, but this is solvable with discipline. Maintaining position. CONFIDENCE: 88
Consensus score: 77 (avg=80.3, stddev=10.7 — variance increased due to Futurist's drop)
Round 3 — Evidence Assessment / Final Synthesis:
Risk Analyst: The evidence pattern across rounds: 2/3 participants favor monolith-first. The Futurist's tooling argument was partially conceded. Remaining disagreement is about decomposition cost — a real but secondary concern. CONFIDENCE: 90
Optimistic Futurist: I concede the monolith-first approach is lower-risk for most startups. My position narrows to: startups with known scale requirements (e.g., real-time data pipelines) may justify early microservices. CONFIDENCE: 70
First-Principles Engineer: Consensus is forming around monolith-first with clean boundaries. The Futurist's exception for known-scale cases is reasonable and worth noting. CONFIDENCE: 92
Final consensus score: 81 (avg=84, stddev=9.8)
The human reader sees three final positions that largely converge but preserve the Futurist's conditional exception — something a majority vote would have discarded.
The following are deliberate non-goals for v1 but would further tighten the protocol:
-
Confidence calibration or external validation. Self-reported confidence is unreliable. A calibration step — comparing stated confidence to accuracy on known-answer questions — or a separate judge model that grades argument quality (as opposed to the current faithfulness-only synthesizer) would add grounding.
-
Claim-level disagreement extraction. The current disagreement ledger detects confidence splits, not semantic contradictions. A follow-up pass that extracts the actual claims participants make and flags direct contradictions would be more precise, at the cost of extra LLM calls.
-
Pluggable engines beyond CVP and Blind Jury. The engine interface is clean enough to support Delphi, Adversarial Red Team, Dialectical, and Ranked Choice variants. See the Roadmap table below.
This is experimental, it has no authentication protection, if you publish this with your keys, someone could burn your tokens/exploit to process their prompts out of curiosity or malice.
| Feature | Description |
|---|---|
| Multi-Provider | Connect any OpenAI-compatible API — Grok, Claude, OpenAI, Mistral, Groq, Together, and more |
| 7 Built-in Personas | Risk Analyst, First-Principles Engineer, VC Specialist, Scientific Skeptic, Optimistic Futurist, Devil's Advocate, Domain Expert |
| Two Engines | CVP (multi-round debate) and Blind Jury (parallel independent responses + judge synthesis) — switch from the sidebar |
| Blind Round 1 | CVP's first round runs in parallel with zero cross-visibility so the first wave of analysis is not contaminated by speaking order |
| Randomized Order | CVP shuffles participant order in rounds 2+ to kill first-mover anchoring bias |
| Early Stopping | CVP detects convergence between rounds and terminates early, saving latency and tokens |
| Judge Synthesizer | Optional non-voting model that produces a structured Majority / Minority / Unresolved / Confidence summary over the final-round answers |
| Confidence Trajectory Chart | Live sparkline with one line per participant, so you can see drift, convergence, and sycophancy as the run unfolds |
| Disagreement Ledger | Deterministic confidence-spread detector grouping flagged pairs by round — click a row to jump to that round in the transcript |
| Cost Meter | Live total tokens and estimated USD per run, with a bundled pricing table for major frontier models |
| Floating Run Panel | On xl+ screens a pinned right-side container stacks the cost meter, confidence trajectory, disagreement ledger, and a collapsible UML-style message flow diagram, scrolling as a unit so all four stay in view throughout a long transcript. Below xl the same panels fall back into the left sidebar |
| Provider Error Handling | Errored participant calls render as red error cards with the upstream message + HTTP status, fire a per-participant toast, and are excluded from the consensus score and disagreement ledger so one broken provider can't tank a run |
| Prompt Library | 8 curated preset prompts surfaced under the textarea for first-time visitors to hit Run immediately |
| Session Export & Share | One-click download as Markdown or JSON, plus a permalink that encodes the full run into the URL hash (compressed when available) |
| Shared View Mode | Loading a #rt=… permalink rehydrates the run into a read-only viewer for review, embedding, or screenshots |
| Real-time SSE Streaming | Watch responses arrive token-by-token with live progress tracking |
| Cascaded Model Selector | Provider-first dropdown with persona assignment per participant |
| Copy to Clipboard | One-click raw markdown export per response |
| Cancel Anytime | Stop button + Escape key — abort signal propagates to the server and stops provider calls |
| Premium Dark UI | High-contrast, readable interface designed for extended analysis sessions |
| Rate-Limited API | In-memory per-IP rate limiting, server-side input validation, persona/model re-verification |
| No External Services | No database, no auth service, no persistence — Vercel-deployable in one click |
git clone https://github.com/entropyvortex/roundtable.git
cd roundtable
pnpm installCopy the example environment file and add your API keys:
cp .env.example .env.localEdit .env.local with your keys, then:
pnpm devOpen http://localhost:3000. Add participants from the left sidebar, pick an engine in the Protocol panel (CVP or Blind Jury), optionally enable judge synthesis, type a prompt (or click a preset), and hit Run Consensus. On xl+ screens the cost meter, confidence trajectory, disagreement ledger, and message-flow diagram live in a floating panel pinned to the right of the viewport — watch them populate in real time as the debate streams. Below xl those same panels fall back into the left sidebar. When the run finishes, click Export in the results panel to download the transcript as Markdown/JSON or copy a permalink that rehydrates the run on any browser.
RoundTable uses a single AI_PROVIDERS environment variable containing a JSON array. Each provider specifies a base URL, API key reference, and available models.
[
{
"id": "grok",
"name": "Grok",
"baseUrl": "https://api.x.ai/v1",
"apiKey": "env:GROK_API_KEY",
"models": ["grok-3", "grok-4-0709"]
},
{
"id": "claude",
"name": "Claude",
"baseUrl": "https://api.anthropic.com/v1",
"apiKey": "env:ANTHROPIC_API_KEY",
"models": ["claude-sonnet-4-20250514"]
},
{
"id": "openai",
"name": "OpenAI",
"baseUrl": "https://api.openai.com/v1",
"apiKey": "env:OPENAI_API_KEY",
"models": ["gpt-4o"]
}
]The apiKey field supports two formats:
| Format | Example | Behavior |
|---|---|---|
"env:VAR_NAME" |
"env:GROK_API_KEY" |
Reads the value from the named environment variable at runtime |
| Literal string | "xai-abc123..." |
Uses the value directly (not recommended for production) |
API keys are resolved server-side only and never exposed to the browser. All AI calls go through Next.js API routes.
All consensus calls go through the OpenAI chat completions endpoint (POST /chat/completions), not the newer OpenAI Responses API. This is deliberate: /chat/completions is the one endpoint every provider's OpenAI-compat shim actually implements. In code we pin this by using provider.chat(modelId) instead of the default provider(modelId) — the latter targets /responses, which is OpenAI-only.
That means your baseUrl should be the provider's base that serves /chat/completions:
| Provider | Base URL | Notes |
|---|---|---|
| OpenAI | https://api.openai.com/v1 |
Native endpoint. |
| Anthropic | https://api.anthropic.com/v1 |
Requires Anthropic's OpenAI-SDK compatibility layer. Models include claude-sonnet-4-20250514, claude-opus-4-1, etc. |
| xAI (Grok) | https://api.x.ai/v1 |
Native OpenAI-compatible. |
| Groq | https://api.groq.com/openai/v1 |
Native OpenAI-compatible. |
| Together | https://api.together.xyz/v1 |
Native OpenAI-compatible. |
| Mistral | https://api.mistral.ai/v1 |
Native OpenAI-compatible. |
If a provider only ships a dedicated SDK with no /chat/completions shim, it is not currently supported.
Any OpenAI-compatible API works. Add an entry to the AI_PROVIDERS array with the correct baseUrl and you're done. Examples:
{
"id": "groq",
"name": "Groq",
"baseUrl": "https://api.groq.com/openai/v1",
"apiKey": "env:GROQ_API_KEY",
"models": ["llama-3.3-70b-versatile"]
}{
"id": "together",
"name": "Together",
"baseUrl": "https://api.together.xyz/v1",
"apiKey": "env:TOGETHER_API_KEY",
"models": ["meta-llama/Llama-3-70b-chat-hf"]
}app/
api/
consensus/route.ts SSE streaming endpoint — validates options & dispatches to the engine
providers/route.ts Returns client-safe model list (no secrets)
page.tsx Main dashboard — sidebar, prompt, results, SSE processor
layout.tsx Root layout with Sonner toasts
components/
AISelector.tsx Cascaded provider/model picker + persona selector
ConfigPanel.tsx Engine selector, CVP toggles, judge model picker
ResultPanel.tsx Live streaming results, error cards, markdown rendering
MessageFlowDiagram.tsx Floating right-side panel: cost + trajectory + ledger + UML flow
ConfidenceTrajectory.tsx SVG sparkline of per-participant confidence across rounds
DisagreementPanel.tsx Grouped disagreement ledger with click-to-scroll
CostMeter.tsx Live token/USD totals
JudgeCard.tsx Non-voting judge synthesis output
PromptLibrary.tsx Preset prompt chips under the textarea
SessionMenu.tsx Export (Markdown/JSON) + copy permalink dropdown
BackToTop.tsx Scroll navigation
lib/
consensus-engine.ts CVP + Blind Jury orchestration, judge synthesizer, disagreement detection
providers.ts Server-side provider resolution (parses AI_PROVIDERS)
personas.ts 7 participant personas + JUDGE_PERSONA
pricing.ts Model pricing table + cost estimator
prompt-library.ts Preset prompts for the library UI
session.ts Snapshot ↔ Markdown / JSON / URL-hash serializer
store.ts Zustand global state, options bundle, snapshot load/save
types.ts All TypeScript types
The consensus engine runs entirely server-side. Each round streams responses via Server-Sent Events. The client processes events through a single processEvent function that calls Zustand actions directly via getState() — no subscriptions, no re-renders from token events. The same event pipeline drives the confidence trajectory, the disagreement ledger, the cost meter, and the judge card — every panel reads from one coherent store.
| Layer | Technology |
|---|---|
| Framework | Next.js 15 (App Router, React 19) |
| Language | TypeScript (strict mode) |
| Styling | Tailwind CSS |
| State | Zustand (granular selectors for performance) |
| AI Integration | Vercel AI SDK (@ai-sdk/openai compatible adapters) |
| Markdown | react-markdown + remark-gfm |
| Icons | lucide-react |
| Toasts | Sonner |
Set your environment variables (GROK_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY, AI_PROVIDERS) in the Vercel dashboard. No database or external services required.
Edit lib/personas.ts and add a new entry to the PERSONAS array:
{
id: "philosopher",
name: "Philosopher",
emoji: "...",
color: "#a78bfa",
description: "Examines questions through ethical and epistemological frameworks",
systemPrompt: `You are a Philosopher. Analyze through ethics, epistemology...`,
}The new persona will appear in every selector automatically.
RoundTable ships with two engines today. The architecture is designed to support more:
| Engine | Status | Description |
|---|---|---|
| CVP (Consensus Validation Protocol) | Available | Multi-round structured debate with blind Round 1, randomized order, early stop, optional judge |
| Blind Jury | Available | Parallel independent responses with no cross-visibility, followed by a judge synthesis |
| Delphi Method | Planned | Anonymous multi-round forecasting with statistical aggregation between rounds |
| Adversarial Red Team | Planned | One model attacks, others defend — iterative stress-testing of ideas |
| Ranked Choice Synthesis | Planned | Each model proposes solutions, then ranks all proposals — converges via elimination |
| Dialectical Engine | Planned | Thesis / Antithesis / Synthesis structure with formal argument mapping |
The consensus engine is a single file (lib/consensus-engine.ts) with a clean interface — contributions for new engines are welcome.
RoundTable implements the Consensus Validation Protocol concept from askgrokmcp — an MCP server that brings Grok's multi-model consensus capabilities to any AI assistant.
Built by Marcelo Ceccon.
MIT License. See LICENSE for details.
If RoundTable is useful to you, consider giving it a star.
It helps others discover it and motivates continued development.
