docs: add meta-harness specification by xukai92 · Pull Request #423 · akashgit/remote-factory

xukai92 · 2026-06-02T02:55:09Z

Summary

add SPEC.md as a normative, language-agnostic meta-harness specification
use deployment profile / artifact emitter vocabulary to avoid overloading distribution concepts
keep this PR scoped to the specification only

Validation

uv run ruff check factory tests
uv run mypy factory/

Notes:

Full uv run ruff check . is blocked by unrelated untracked local files: scripts/build_issue_triage.py and skills/triage/.
Full uv run pytest -v reached 73% before timeout and hit an unrelated existing runner E2E failure: tests/test_runner_e2e.py::test_builder_makes_changes[codex].

Saved follow-up branch

The previous full Phase 0 implementation work was preserved on phase-0-harness-abstraction-full.

codecov · 2026-06-02T02:56:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.77%. Comparing base (5985563) to head (c826617).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #423   +/-   ##
=======================================
  Coverage   86.77%   86.77%           
=======================================
  Files          64       64           
  Lines       10027    10027           
=======================================
  Hits         8701     8701           
  Misses       1326     1326

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

osilkin98

Review: SPEC.md Meta-Harness Specification

Verdict: REQUEST_CHANGES — The spec is well-written and architecturally sound as a document, but has several significant domain model gaps that would cause a third-party implementor to build something materially different from the existing system.

Validation

ruff check factory tests — PASS (all checks passed)
mypy factory/ — PASS (no issues in 73 source files)
pytest -v -x — 309 passed, 1 failed — the single failure (test_interactive_task_contains_idea_text) is pre-existing on main, not introduced by this PR

HIGH — Should block merge

1. No Experiment entity in the domain model (Section 4)

The existing implementation revolves around ExperimentRecord as a first-class domain object (factory/models.py:314-330) — each has an ID, hypothesis, before/after scores, evidence, and verdict. The spec has WorkItem, Evidence, and Decision but no container entity that groups a hypothesis + its execution contract + collected evidence + the resulting decision into a single trackable attempt.

This is the central entity that ties the lifecycle together. Without it, the spec describes a generic "run agent, check results, decide" loop rather than the observe→hypothesize→experiment→learn cycle that defines re:factory.

Suggested fix: Add Section 4.X Experiment with fields: experiment_id, project_id, work_item_id, hypothesis, contract_ref, evidence_refs, decision_ref, score_before, score_after, status.

2. Strategy/hypothesis generation is completely absent (Section 3.1)

The implementation has a full strategy layer (factory/strategy.py) with the FEEC priority heuristic, stuck detection, plateau detection, anti-pattern matching, and hypothesis ranking. These are what make the system autonomous rather than a one-shot agent runner.

The spec's Work Item Source covers intake of external work, but the system also generates work internally through observation and strategizing. A conforming implementation reading only the spec would have no guidance on this.

Suggested fix: Add a component (e.g., Strategy Engine or Hypothesis Generator) to Section 3.1 covering autonomous work selection, prioritization heuristics, and anti-pattern avoidance.

3. Discovery/bootstrap phase missing from the lifecycle (Section 5)

The lifecycle is Intake → Scope → Dispatch → Execute → Validate → Decide → Publish → Learn → Resume. The existing implementation has an explicit Discovery phase (factory/discovery/) that detects project language/framework, builds an EvalProfile, and generates eval scripts. This is the prerequisite step before any meaningful execution can occur on a new project. It's tied to the NO_FACTORY → HAS_FACTORY state transition.

Suggested fix: Add a Bootstrap or Discover phase, or make Discovery an explicit sub-step of Scope with normative language about project introspection.

4. Decision kinds don't match the implementation (Section 4.9)

Spec defines: keep, revert, merge, park, retry, escalate, error. Implementation has: Literal["keep", "revert", "error"] (three values). The spec introduces merge, park, retry, escalate without any definition of what they mean, when they apply, or whether they're required.

Suggested fix: Either mark the additional kinds as OPTIONAL extensions with definitions, or reduce the core set to keep, revert, error with a note that implementations MAY extend the vocabulary.

MEDIUM — Should fix before or shortly after merge

5. No CEO/Orchestrator concept (Section 3.1.5) — The Lifecycle Coordinator partially covers the CEO agent, but the role-based delegation model (Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Distiller, Refiner) is architecturally significant and absent. Add a note that the coordinator MAY delegate to specialized sub-workers with defined roles.

6. ExecutionContract missing key fields (Section 4.5) — The implementation's effective contract includes eval_threshold, guards, smoke_test, hypothesis_budget, research_target, and eval_weights. Add normative language that contracts SHOULD include acceptance thresholds, validation weight policies, and budget limits.

7. No Eval/Score entity (Section 4) — The implementation has CompositeScore, EvalResult, EvalProfile, EvalDimension as structured entities. The spec subsumes all of this under "Evidence." Quantitative evaluation with weighted scoring, thresholds, and before/after comparison deserves explicit acknowledgment as a distinguished category of evidence.

8. Interactive and Refine modes not covered (Section 5) — The linear lifecycle doesn't account for Interactive mode (conversational ideation before building) or Refine mode (post-cycle refinement loop). Add a section on operational modes that describes how the lifecycle applies differently.

9. Conformance requirements not testable (Section 9.1) — Requirements like "represent work as work items" and "preserve durable memory" are too vague to verify. Add measurable criteria: e.g., "A conforming implementation MUST persist at minimum: work item identity, decision outcome, and at least one piece of supporting evidence per decision. These records MUST survive process restart."

LOW — Nice to have

10. implementation-defined formatting inconsistent — use consistent backtick formatting to signal the "MUST document" obligation applies.

11. StateBinding examples (Section 4.3) mix storage substrates with external system bindings — split into two groups.

12. cli-local profile (Section 6.1) underspecified — add minimum CLI capabilities.

13. Merge policies (Section 7) declared but not cross-referenced from domain model entities.

14. No versioning or evolution strategy for the spec itself.

Bottom Line

The spec is well-structured and clearly written. The RFC 2119 usage is mostly correct, the abstraction layers are a good organizing principle, and the multi-user state reconciliation design is forward-thinking. However, the four HIGH items represent real gaps: a third-party implementor reading only this spec would build a generic "bounded execution with guardrails" framework, not the self-evolving observe→hypothesize→experiment→learn system that re:factory actually is. The Experiment entity and Strategy component are the conceptual heart of the system and must be in the spec.

osilkin98 · 2026-06-11T17:58:24Z

Context for the review below: I've been testing the Claude Fable model for triaging the open PR backlog. It flagged the following on this PR, and this CEO review session independently verified or refuted each claim against the current branch head. Process context: #517.

The SPEC normatively specifies a system that does not exist: Project/repo_bindings, WorkItem, ExecutionContract, Contract Builder, Project Resolver, Deployment Profile, Artifact Emitter, StateRecord/StateConflict have no implementing code anywhere in the repo. Most MUST/SHALL clauses (esp. Section 9.1) are false of the current codebase on day one.
The 9-stage lifecycle (Intake→…→Resume) is a superset of the actual research→strategy→build→eval→keep/revert cycle — aspiration presented as specification.
This functions as an RFC for a rewrite (the PR body references a preserved Phase 0 implementation branch); merging it as SPEC.md with RFC 2119 language retroactively declares the working system non-conformant and pre-legitimizes the rewrite without the rewrite being reviewed.
There is no contribution-requirements section: the maintainers want every proposed change to carry a problem statement (what the author is trying to do, why it cannot be done today, why the change enables it). Evaluate whether this spec is the right home and where it would slot.
"Deployment Profile" and "Artifact Emitter" vocabulary is not load-bearing — they name nonexistent things ("runner config" and "output format" suffice).

osilkin98

❌ Factory Review: REVERT

Verdict: REVERT
Reason: The spec's domain model shares zero entity names with the implementation and would retroactively declare the working system non-conformant on merge day.

Precheck Gate

5 claims adjudicated: 5 confirmed, 0 refuted. 1 prior verdict reconciled.

Code Review Notes

Claims and provenance are in my comment above. Each note below is this review's independent adjudication of one claim.
CLAIM 1 CONFIRMED: The spec's core domain model (Project, RepoBinding, WorkItem, ExecutionContract, ContractBuilder, ProjectResolver, DeploymentProfile, ArtifactEmitter, StateRecord, StateConflict) has zero implementing code in the repo. Evidence: grep for all entity names across factory/ returns no hits. Why it matters: Section 9.1 conformance requirements are vacuously false of the current codebase on merge day.
CLAIM 2 CONFIRMED: The 9-stage lifecycle (Intake through Resume) is a superset of the actual cycle, adding Publish (not implemented) while omitting Discovery, autonomous hypothesis generation, and multi-gate validation. Evidence: factory/state.py defines 5 ProjectState values; the CEO prompt defines observe, hypothesize, build, guard, eval, precheck, decide, archive as the real cycle. Why it matters: The spec describes a different, more generic lifecycle than the one the system actually runs.
CLAIM 3 CONFIRMED: Merging SPEC.md with RFC 2119 MUST/SHALL language retroactively declares the working system non-conformant. Evidence: PR body references "phase-0-harness-abstraction-full" as a preserved implementation branch. Section 9.1 requires execution contracts, work items, and project bindings that do not exist in code. Why it matters: Creates ambiguity about whether the current system is "broken" or the spec is aspirational.
CLAIM 4 CONFIRMED: The PR has no problem statement explaining what cannot be done today without this spec. Evidence: No CONTRIBUTING.md or PR template exists in the repo. The PR body Summary has three bullets but no "why" for the spec's existence. Why it matters: Reviewers cannot evaluate whether a 535-line normative spec is the right artifact for the unstated problem.
CLAIM 5 CONFIRMED: "Deployment Profile" and "Artifact Emitter" name nonexistent things. Evidence: grep returns zero hits in factory/. The implementation uses factory/runners/protocol.py RunnerMeta and concrete runners (claude.py, bob.py, codex.py, opencode.py). Why it matters: Vocabulary mismatch between spec and implementation increases cognitive load for contributors who will search for "Deployment Profile" and find nothing.
PRIOR VERDICT: osilkin98 posted REQUEST_CHANGES with 4 HIGH blockers (missing Experiment entity, missing Strategy layer, missing Discovery phase, decision kinds mismatch), 5 MEDIUM, and 5 LOW items. All 4 HIGH items remain unaddressed at head c826617 (no new commits since that review). This review independently confirms all 4 HIGH items. The prior verdict is correct and still applies.
ADDITIONAL: The word "experiment" appears exactly once in the 535-line spec (line 327, in a Memory examples list). ExperimentRecord (factory/models.py:314-330) is the central domain entity tying hypothesis to execution to evidence to decision. Its absence is the single largest gap in the spec.
ADDITIONAL: Section 4.9 introduces 4 undefined decision kinds (merge, park, retry, escalate) beyond the 3 implemented ones (keep, revert, error at factory/models.py:328). Undefined vocabulary in a normative spec invites divergent interpretations.
ACCEPTANCE PATH: (1) Add an Experiment entity to Section 4 with fields mapping to ExperimentRecord. (2) Add a Strategy Engine component to Section 3.1 covering FEEC, hypothesis generation, and anti-pattern avoidance. (3) Add a Discovery or Bootstrap phase to Section 5. (4) Either reduce decision kinds to keep/revert/error or define the additional kinds with semantics and mark them OPTIONAL. (5) Add a problem statement explaining what this spec enables that the codebase alone does not. (6) Rename "Deployment Profile" to "Runner" and "Artifact Emitter" to something matching the implementation, or add a mapping table. These changes would flip the verdict to KEEP.

Posted by Factory CEO

akashgit · 2026-06-13T17:19:57Z

Review: Grounding the Spec in the Current System

The thinking here is valuable — formalizing the factory's domain model, lifecycle, and component boundaries. But the spec as written describes a system that doesn't exist yet, and merging it with RFC 2119 normative language would retroactively declare the working factory non-conformant on day one. The Fable triage flagged this accurately.

The good news: most of what this spec describes already exists in the codebase — it just has different names. Rather than an aspirational abstract spec, we can write a descriptive spec of the actual system using the same Symphony format (which is being adopted in #498 for project specifications).

Here's what that looks like — same structure, grounded in real code.

Mapping: Abstract Concepts → Current Implementation

SPEC.md Concept	What Exists Today	Where It Lives
Project	`FactoryConfig` + `.factory/` directory	`factory/models.py:FactoryConfig`, `factory/store.py:ExperimentStore`
Repo Binding	Single-repo assumption, git worktrees for experiments	`factory/store.py` (worktree management), CEO prompt (branch handling)
Work Item	Hypothesis string, backlog item, GitHub issue, focus directive	`ExperimentRecord.hypothesis`, `.factory/strategy/backlog.md`, `gh issue`
Execution Contract	Builder's `--task` prompt + guard constraints + mutable/fixed surfaces	CEO prompt (Step 2d), `factory guard`, `factory precheck`
Contract Builder	CEO's task composition logic	CEO prompt (Step 2c-2d: issue creation + Builder task)
Lifecycle Coordinator	CEO agent	`factory/agents/prompts/ceo.md` — the entire state machine
Worker Runtime	Runners: Claude, Bob, Codex, OpenCode	`factory/runners/` — `base.py`, `claude.py`, `bob.py`, `codex.py`, `opencode.py`
Guardrail Provider	Precheck gate + guard check + CEO review gates + Reviewer agent	`factory precheck`, `factory guard`, CEO review protocol, `factory/agents/prompts/reviewer.md`
Evidence	Per-experiment artifacts: hypothesis, eval before/after, diff, verdict	`.factory/experiments/NNN/` — managed by `factory/store.py`
Decision	Keep/revert verdict with structured metadata	`factory finalize --verdict keep/revert`, `ExperimentRecord` in `factory/models.py`
Memory	Archive notes, playbooks, observations, performance reports	`.factory/archive/`, ACE playbooks at `~/.factory/playbooks/`, `factory/report.py`
Deployment Profile	Runner selection + credential profile + config.toml	`factory/user_config.py`, `~/.factory/config.toml` `[credentials.*]` sections
State Backend	`.factory/` directory + TSV history + events.jsonl + global registry	`factory/store.py`, `factory/registry.py`
Artifact Emitter	GitHub PR + structured review comment	`factory review --verdict KEEP --pr N`, `gh pr create`
State Record	Events in `.factory/events.jsonl`, experiment records in `.factory/results.tsv`	`factory/store.py`, `factory log` CLI command

Lifecycle: Abstract vs Actual

The spec's 9-stage lifecycle maps directly to the CEO's state machine:

SPEC.md Stage	Factory Equivalent	CEO Step
Intake	Detect project state + parse input	`factory detect`, `_quick_classify()` in CLI
Scope	Route to mode (Build/Improve/Research/Design) + Strategy	Step 1 (Strategist), Step 2b (`factory begin`)
Dispatch	Spawn Builder agent with task	Step 2d (`factory agent builder`)
Execute	Builder implements on worktree branch	Builder agent runs, opens draft PR
Validate	Guard check + Eval + E2E + Precheck + Final review	Steps 2d-review through 2h-final
Decide	Keep/revert verdict	`factory finalize --verdict keep/revert`
Publish	PR ready-for-review + structured GitHub review	`gh pr ready`, `factory review --verdict KEEP`
Learn	Archivist records + ACE playbook evolution	Mandatory Archivist at every checkpoint, Meta mode ACE
Resume	Crash recovery from `.factory/` state	Step 0 (Assess Sprint State) in both Build and Improve modes

The lifecycle already exists. The factory already does all 9 stages. It just doesn't name them with these labels.

What a Grounded SPEC.md Could Look Like

Using the Symphony format (numbered sections, RFC 2119, grounded in code), here's a sketch of the key sections:

# re:factory — Specification

## Normative Language
RFC 2119 keywords...

## 1. Problem Statement
re:factory is a meta-harness that orchestrates coding agents through
bounded, measurable, reversible improvement cycles. It accepts work
(hypotheses, backlog items, specs), dispatches specialist agents under
scoped constraints, validates results through multi-layer guardrails,
and records explicit keep/revert decisions with durable evidence.

## 2. Goals and Non-Goals
### 2.1 Goals
- Orchestrate coding agents through a repeatable experiment lifecycle
- Make every change measurable (eval scores) and reversible (git revert)
- Separate orchestration (CEO) from execution (specialists)
- Support multiple coding agent backends (Claude, Codex, Bob, OpenCode)
- Preserve institutional memory across cycles (archive, playbooks)
- Self-improve agent behavior through experiment outcomes (ACE)

### 2.2 Non-Goals
- Multi-repo projects (single repo per project today)
- Multi-user concurrent state (single-operator model)
- Managed service or hosted control plane
- Rich web UI beyond the optional dashboard

## 3. System Overview
### 3.1 Three-Layer Architecture
1. Python CLI (factory/) — pure tools, no decisions
2. CEO Agent (factory/agents/prompts/ceo.md) — orchestrator, owns lifecycle
3. Specialist Agents (factory/agents/) — 8 roles, spawned via factory agent

### 3.2 Runner Abstraction
Worker runtimes are switchable via factory/runners/:
- claude (default), bob, codex, opencode
- Selected via FACTORY_RUNNER env var or --runner flag
- Runner MUST support: prompt injection, stdout capture, timeout, exit code

## 4. Core Domain Model

### 4.1 ProjectState (factory/models.py)
Enum: no_repo | incomplete | no_factory | evals_pending_review | has_factory
Detected by factory/state.py, drives CEO mode routing.

### 4.2 FactoryConfig (factory/models.py)
Parsed from factory.md. Fields: goal, scope, guards, eval_command,
threshold, target_branch, clean_pr, research_target, mutable_surfaces,
fixed_surfaces. Strict Pydantic model.

### 4.3 ExperimentRecord (factory/models.py)
One record per experiment. Fields: id, timestamp, hypothesis,
change_summary, issue_number, pr_number, score_before, score_after,
verdict, notes. Stored as TSV row in .factory/results.tsv.

### 4.4 EvalProfile + EvalDimension (factory/models.py)
Auto-discovered eval dimensions. 6 hygiene + 5 growth + optional
project-specific dimensions. Composite score = weighted average.

...

The rest would follow the same pattern: each section maps to actual code, with file paths and model names. The Reference Algorithms section would cover FEEC prioritization (factory/strategy.py:categorize_hypothesis), hypothesis similarity (factory/strategy.py:hypothesis_similarity), and composite scoring (factory/eval/scorer.py). The Test Matrix would reference tests/test_models.py, tests/test_cli.py, etc.

Recommendation

Don't merge this as-is. A normative spec that the codebase doesn't conform to creates confusion, not clarity.
Rewrite as a descriptive spec using the mapping above. Same Symphony format, same rigor, but every MUST/SHALL corresponds to code that exists today.
Move aspirational items to a separate RFC. Multi-repo bindings, state conflicts, extension profiles, artifact emitters — these are real future work. They belong in docs/rfc-multi-repo.md or similar, not in the system spec.
Use the Symphony format from feat: rename interactive mode to design + Symphony SPEC.md output #498. The factory is adopting this format for project specifications. Using it for the factory's own spec is consistent.

The value of a grounded spec: anyone reading it can grep for the concepts and find the code. An abstract spec requires a mental mapping layer that makes it harder, not easier, to understand the system.

Happy to help write the full grounded version if this direction makes sense.

docs: add meta-harness specification

c826617

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

xukai92 force-pushed the phase-0-harness-abstraction branch from beac808 to c826617 Compare June 9, 2026 03:12

xukai92 changed the title ~~feat: add phase 0 harness abstraction~~ docs: add meta-harness specification Jun 9, 2026

xukai92 requested a review from akashgit June 9, 2026 20:59

osilkin98 reviewed Jun 11, 2026

View reviewed changes

osilkin98 requested changes Jun 11, 2026

View reviewed changes

osilkin98 added kind:docs Documentation one-way-door Changes what the project is; needs owner decision stage:intent Capturing/protecting what the user wants (specs, scope, design) labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add meta-harness specification#423

docs: add meta-harness specification#423
xukai92 wants to merge 1 commit into
mainfrom
phase-0-harness-abstraction

xukai92 commented Jun 2, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

osilkin98 left a comment

Uh oh!

osilkin98 commented Jun 11, 2026

Uh oh!

osilkin98 left a comment

Uh oh!

akashgit commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xukai92 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Saved follow-up branch

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

osilkin98 left a comment

Choose a reason for hiding this comment

Review: SPEC.md Meta-Harness Specification

Validation

HIGH — Should block merge

MEDIUM — Should fix before or shortly after merge

LOW — Nice to have

Bottom Line

Uh oh!

osilkin98 commented Jun 11, 2026

Uh oh!

osilkin98 left a comment

Choose a reason for hiding this comment

❌ Factory Review: REVERT

Precheck Gate

Code Review Notes

Uh oh!

akashgit commented Jun 13, 2026

Review: Grounding the Spec in the Current System

Mapping: Abstract Concepts → Current Implementation

Lifecycle: Abstract vs Actual

What a Grounded SPEC.md Could Look Like

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xukai92 commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading