Production 3-agent harness for autonomous multi-hour builds with parallel fleet execution.
I'm a solo developer building seekora.ai — a multi-tenant SaaS across 11 repos in Kubernetes (Go + TypeScript + Temporal + a stack of 7 datastores). Long-running build tasks — multi-file refactors, cross-repo features, end-to-end specs — fail in three ways with off-the-shelf AI tooling:
- Context loss mid-run. The model forgets earlier decisions partway through a multi-hour build.
- Confidently wrong output. It generates code that runs but is incorrect, and no one catches it.
- Session loss on crash. Laptop sleeps, network drops, terminal closes — and progress evaporates.
The harness is a production answer to each. Planning is separated from generation, generation from evaluation, and any wave of work resumes cleanly from any crash point. Cursor agent mode, Devin, and Cognition each fit one slice — none fit the specific shape of multi-hour autonomous builds I actually run on my own SaaS.
Planner (Opus, multi-phase pipeline)
→ Architect → Adversary → Refiner → Validator
→ work_plan.json (Phase → Epic → Story → Task)
Generator (Sonnet, N sessions) ↔ Evaluator (Sonnet)
→ Sequential: one feature at a time
→ Parallel: wave-based fleet (Citadel-inspired)
→ scope claims, discovery relay, worktree isolation
Circuit Breaker (every cycle)
→ CLOSED/HALF_OPEN/OPEN stagnation detection
Control Plane (optional, localhost:7842)
→ Live tracking, wave events, work plan visualization
→ Offline buffering when control plane is down
# Install into current directory
bash ~/harness/install.sh
# Install into a specific project
bash ~/harness/install.sh /path/to/project
# Overwrite existing installation
bash ~/harness/install.sh /path/to/project --forceThis copies the harness into .harness/ inside your project.
cd /path/to/project
# Build from a prompt (planner generates spec + plan automatically)
python3 .harness/run.py --prompt "Build a REST API with JWT auth and rate limiting"
# Build from an existing spec
python3 .harness/run.py --spec docs/specs/my-feature.md
# Build from an existing plan
python3 .harness/run.py --plan docs/plans/my-plan.md
# Resume after crash or interruption
python3 .harness/run.py --resume
# With cost/time/iteration limits
python3 .harness/run.py --prompt "..." --max-cost 50 --max-duration 120 --max-iterations 20
# Override model
python3 .harness/run.py --prompt "..." --model claude-sonnet-4-6Edit .harness/config.yaml after install:
# Models
model: "claude-sonnet-4-6" # generator + evaluator
planner_model: "claude-opus-4-6" # planner (Opus recommended)
# Limits
max_cost_usd: 100.0
max_duration_minutes: 240
max_iterations: 50
# Parallel fleet mode (opt-in)
parallel:
enabled: false # flip to true to activate
max_workers: 3 # max concurrent generators per wave
agent_timeout_minutes: 30
discovery_relay: true # share findings between waves
merge_strategy: requeue # requeue conflicting features
# Evaluator
evaluator:
test_suite_command: "npm test" # your test command
browser_verification: auto # auto | neverStart the control plane for live tracking:
cd ~/harness-control-plane
npm start # runs on localhost:7842The harness auto-detects it. If it's down, events buffer locally and drain when it comes back.
# Set explicitly via env var
HARNESS_CONTROL_PLANE_URL=http://localhost:7842 python3 .harness/run.py --prompt "..."When parallel.enabled: true, independent features run simultaneously in isolated git worktrees:
Wave 1: feat-001 + feat-002 + feat-003 → run in parallel worktrees
← collect results, compress discovery briefs
← merge branches (requeue on conflict)
Wave 2: feat-004 (depends on 001,002) → informed by Wave 1 discoveries
← collect, merge
Requirements:
- Tasks must have a
scopefield (planner populates this automatically) - Git must be initialized in the project
- Features in the same wave must not touch the same files
- Planner (1 session, Opus) — reads your prompt/spec, produces a hierarchical work plan
- Generator (N sessions, Sonnet) — implements one feature per session
- Evaluator (N sessions, Sonnet) — grades each feature against acceptance criteria
- Circuit Breaker — detects stagnation, prevents infinite loops
- Resume — crash-safe;
--resumepicks up where it left off
Real plans shipped through the harness from docs/plans/:
harness-kernel-streamlining-plan.md— refactoring the orchestrator's core looptoken-efficiency-t0-t2.md— three-tier token-budget optimization across the planner pipelinesubmodule-merge-conflict-resolution.md— automated cross-repo merge logicharness-linear-equivalent-prototype-plan.md— Linear-style work-tracking surfaceharness-product-refactor-map.md— restructuring around the product abstractionspec-fidelity-kernel.md— guarantees that generated code stays aligned with the spec
For each: the Planner reads the source spec, produces a hierarchical work plan (Phase → Epic → Story → Task), the Generator runs across N worktrees in parallel, the Evaluator grades each feature against acceptance criteria before allowing merge to the integration branch. Adversary + Refiner + Validator passes catch architectural drift and ensure spec fidelity.
Each of these has hit during real builds and shaped how the harness is structured:
- Stagnation. Generator gets stuck rewriting the same file or re-running the same failing test. Circuit Breaker (CLOSED → HALF_OPEN → OPEN) detects this and forces a re-plan instead of letting the loop run unbounded. →
src/core/ - Crash recovery. Laptop sleep, network drop, OOM, accidental terminal close. State persists per-cycle;
--resumepicks up cleanly without redoing finished work. →specs/graceful-resume-design.md - Cost overruns. Runaway loops would otherwise burn the entire budget. Hard
max_cost_usdlimit; fleet halts the moment it's hit, with state preserved for resume after budget top-up. →src/config.yaml - Worktree corruption. Parallel sessions trying to touch the same files. Scope claims + per-worktree isolation make conflicts impossible at the filesystem level; merge-conflict requeue handles them at the branch level. →
src/core/coordination - Generator/Evaluator disagreement. Generator says "done", Evaluator says "missing X". Re-queue with updated requirements; Circuit Breaker prevents oscillation by escalating to re-plan after N rounds.
- Discovery blackholes. Parallel agents discovering conflicting facts about the codebase. Discovery relay shares findings between waves so Wave 2 starts with Wave 1's compressed knowledge instead of rediscovering. →
src/core/discovery - Spec drift. Generator's interpretation diverges from the original spec over a long run. Spec-fidelity kernel re-anchors against the original spec at validator gates.
cd ~/harness
# Run all tests (539)
python3 -m pytest tests/ -v
# Run specific module tests
python3 -m pytest tests/test_coordination.py -v
python3 -m pytest tests/test_parallel_integration.py -v~/harness/
src/
core/ — orchestrator, state, circuit breaker, parallel, coordination, discovery
prompts/ — planner, generator, evaluator, architect, adversary, refiner, validator
templates/ — retrospective templates
hooks/ — stop-completion hook (Mode B)
config.yaml — default configuration
run.py — CLI entry point
tests/ — 539 tests
test-projects/ — sample projects (smoke-test, hook-test)
install.sh — project installer