@tangle-network/agent-eval

A library for deciding whether an LLM-driven generator did its job.

You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses.

import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval'

const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver())
await session.startChat()
const ship = await session.ship({
  harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 },
})
console.log(ship.result.passed, ship.result.score)

Who this is for

You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works.
You ship a content generator and need quality signal beyond "the LLM said it's good".
You want a release gate that fails on regressions you can name, not vibes.

If that's you, start with docs/concepts.md — 5-minute mental model — then come back here.

Quickstart

From any language: HTTP or RPC

The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything.

npm i -g @tangle-network/agent-eval

# HTTP — long-running
agent-eval serve --port 5005

# stdio RPC — one-shot, batch
echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge

Python:

pip install tangle-agent-eval

from tangle_agent_eval import Client
c = Client()
r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop")
print(r.composite, r.failure_modes)

See docs/wire-protocol.md for the full surface.

From TypeScript: import directly

In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator.

pnpm add @tangle-network/agent-eval

The recipe for a code-generator eval is in SKILL.md §Minimal working path.

Two ways to read this repo

You're a human onboarding — read docs/concepts.md for the mental model, then docs/wire-protocol.md if you'll call from another language, or SKILL.md if you'll embed in TS.
You're an LLM agent writing integration code — read SKILL.md. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.

What's in the box

Module	What it does	Doc
`BuilderSession`	Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators.	concepts.md §three-layer eval
`MultiLayerVerifier`	Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate.	concepts.md §verifiers
`judges`, `createCustomJudge`, `createAntiSlopJudge`	LLM and deterministic judges.	SKILL.md
Wire protocol (`agent-eval serve` / `rpc`)	HTTP and stdio RPC interface for cross-language clients.	wire-protocol.md
`clients/python/`	First-party Python client (`tangle-agent-eval` on PyPI). Version-locked to npm.	clients/python/README.md
`BenchmarkRunner`, `executeScenario`, `ConvergenceTracker`	Multi-turn scenario execution + cross-run tracking.	SKILL.md
`ExperimentTracker`, `PromptOptimizer`, `bisector`	A/B prompts, optimize steering, bisect regressions.	SKILL.md
`runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache`	Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache.	§Evolution loop
`reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`)	Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites.	inline JSDoc
`correlationStudy`, `OutcomeStore`, `ProductRegistry`	Meta-eval: do our scores predict deployment outcomes (revenue, retention)?	inline JSDoc
Telemetry (`telemetry/`, `telemetry/file`)	OTLP export, trace replay, file sinks.	inline JSDoc

Evolution loop

Closing the loop on a prompt or codebase is two adapters + a config. Compose runPromptEvolution with createCompositeMutator (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a SandboxPool.

import {
  createSandboxPool,
  createSandboxCodeMutator,
  createCompositeMutator,
  buildReflectionPrompt,
  parseReflectionResponse,
  runPromptEvolution,
  MutationTelemetry,
  LineageRecorder,
  CostLedger,
  JsonlTrialCache,
} from '@tangle-network/agent-eval'

// 1. Prompt mutator — reflective-mutation reasons over top/bottom trials
const promptMutator = {
  async mutate({ parent, topTrials, bottomTrials, childCount }) {
    const ctx = { target: 'forge-prompt', parentPayload: parent.payload, topTrials, bottomTrials, childCount }
    const reflection = buildReflectionPrompt(ctx)
    const raw = await yourLlm(reflection)
    return parseReflectionResponse(raw, childCount).map((p, i) => ({
      id: `${parent.id}.g${parent.generation + 1}.prompt.${i}`,
      payload: p.payload,
      generation: parent.generation + 1,
      parentId: parent.id,
      label: p.label,
      rationale: p.rationale,
    }))
  },
}

// 2. Code mutator — runs a coding agent in a sandbox slot, captures the diff
const pool = createSandboxPool({
  size: 4,
  factory: {
    async create(id) { return await yourSandboxClient.create({ name: id }) },
    async reset(slot) { await slot.resource.exec('git reset --hard origin/main && git clean -fd') },
    async destroy(slot) { await slot.resource.delete() },
  },
})
const codeMutator = createSandboxCodeMutator({
  pool,
  runner: async ({ slot, parent, topTrials, bottomTrials }) => {
    const result = await slot.resource.task(`Improve the prompt at /repo/forge-prompt.ts...`)
    return [{ ok: true, latencyMs: result.durationMs, costUsd: result.costUsd, artifact: { diff: result.diff } }]
  },
  toVariantPayload: (outcome, parent) => ({ ...parent.payload, codeMutation: outcome.artifact }),
})

// 3. Compose — plateau policy auto-switches when prompt evolution stalls
const composite = createCompositeMutator({
  primary: promptMutator,
  secondary: codeMutator,
  policy: 'plateau',
  plateauThreshold: 0.02,
  plateauPatience: 2,
})

// 4. Run — durable telemetry to disk, crash-resumable
const result = await runPromptEvolution({
  runId: `forge_${Date.now()}`,
  target: 'forge-prompt',
  seedVariants: [{ id: 'v0', payload: { text: currentPrompt }, generation: 0, label: 'baseline' }],
  scenarioIds: referenceCorpus.map(s => s.id),
  reps: 3,
  generations: 5,
  populationSize: 4,
  scoreAdapter: { /* runs your eval against (variant, scenario, rep) */ },
  mutateAdapter: composite,
  cache: new JsonlTrialCache('.evolve/cache.jsonl'),
  objectives: [
    { name: 'score', direction: 'maximize', value: a => a.meanScore },
    { name: 'cost', direction: 'minimize', value: a => a.meanCost },
  ],
})

The MutationTelemetry, LineageRecorder, and CostLedger pass into the code-mutator (and any consumer that wants them) — they emit append-only JSONL of every attempt (success + failure with reason) and a snapshot lineage tree, so a finished run leaves a forensically complete trail under one directory.

For the full primitive surface and rationale, read each module's JSDoc — prompt-evolution.ts, composite-mutator.ts, sandbox-pool.ts, code-mutator.ts, reflective-mutation.ts, evolution-telemetry.ts.

v0.16 highlights — production-rigor primitives

These are the primitives any team running prompt-optimization in production needs, regardless of whether they're writing a paper. v0.15 shipped them under "paper-grade" naming; v0.16 corrects that — they're production-first, paper-grade as a side effect.

HeldOutGate — held-out paired-delta gate with few_runs / negative_delta / overfit_gap rejection codes and a full evidence block on every decision. Sits alongside the existing bootstrap-CI promotion-gate.ts: that one asks "is this real or noise?", this one asks "is this a real win on held-out and not overfit?". Use both.
RunRecord — typed run schema with mandatory snapshot-pinned model, promptHash, configHash, commitSha, costUsd, splitTag. Runtime validator throws on missing fields. Reproducibility falls out for free.
pairedBootstrap, pairedWilcoxon, bhAdjust — statistical primitives every rigorous A/B test needs. Already-existing primitives are re-exported for paper-style aliases.
runCanaries — silent judge-fallback, calibration drift (KS test), distribution shift (chi-square). Catches the failure mode where your judge silently degrades to a constant-0.30 confidence and you ship configs graded by a stub.
summaryTable, paretoChart, gainHistogram — A/B reporting helpers. summaryTable emits markdown with means + 95% bootstrap CIs + paired Wilcoxon p (BH-adjusted) + Cohen's d. Useful for both internal status reports and paper Table 1s.
Researcher — stable interface for an external agent that drives the meta-loop (inspectFailures → proposeChange → applyChange → evaluateChange). Ship a NoopResearcher as a placeholder; real implementations live downstream.
benchmarks/routing — synthetic 16-task router benchmark we own. Ships in the package. Reference wrappers for GSM8K and SWE-Bench Lite live under examples/benchmarks/ — read, copy, adapt. All three implement one BenchmarkAdapter shape with deterministic splits and fail-loud env-var configuration.

v0.16 changes from v0.15

Renamed paperTable → summaryTable, paretoFigure → paretoChart, gainDistributionFigure → gainHistogram. Underlying semantics unchanged. Type names follow (SummaryTable, SummaryTableOptions, SummaryTableRow).
File: src/paper-report.ts → src/summary-report.ts.
Drop the "paper-grade" framing — the primitives are production-first.

See CHANGELOG.md for the full list. .claude/skills/agent-eval/SKILL.md covers usage directives and pitfalls.

Tech stack

TypeScript strict, no semicolons, single quotes, 2-space indent
tsup for bundling, vitest for tests
@tangle-network/tcloud for LLM calls (judges, driver)
hono + @asteasolutions/zod-to-openapi for the wire protocol

Develop

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi             # write dist/openapi.json from the wire schemas

# Run the server locally
node dist/cli.js serve --port 5005

# Python client tests (require pnpm build first)
cd clients/python && pip install -e ".[dev]" && pytest

Release

@tangle-network/agent-eval (npm) and tangle-agent-eval (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.claude/skills/agent-eval		.claude/skills/agent-eval
.github/workflows		.github/workflows
clients/python		clients/python
docs		docs
examples/benchmarks		examples/benchmarks
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tangle-network-agent-eval-0.11.0.tgz		tangle-network-agent-eval-0.11.0.tgz
tangle-network-agent-eval-0.12.0.tgz		tangle-network-agent-eval-0.12.0.tgz
tangle-network-agent-eval-0.6.0.tgz		tangle-network-agent-eval-0.6.0.tgz
tangle-network-agent-eval-0.7.0.tgz		tangle-network-agent-eval-0.7.0.tgz
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@tangle-network/agent-eval

Who this is for

Quickstart

From any language: HTTP or RPC

From TypeScript: import directly

Two ways to read this repo

What's in the box

Evolution loop

v0.16 highlights — production-rigor primitives

v0.16 changes from v0.15

Tech stack

Develop

Release

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@tangle-network/agent-eval

Who this is for

Quickstart

From any language: HTTP or RPC

From TypeScript: import directly

Two ways to read this repo

What's in the box

Evolution loop

v0.16 highlights — production-rigor primitives

v0.16 changes from v0.15

Tech stack

Develop

Release

Related

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages