docs: add meta-harness specification#423
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #423 +/- ##
=======================================
Coverage 86.77% 86.77%
=======================================
Files 64 64
Lines 10027 10027
=======================================
Hits 8701 8701
Misses 1326 1326 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
beac808 to
c826617
Compare
osilkin98
left a comment
There was a problem hiding this comment.
Review: SPEC.md Meta-Harness Specification
Verdict: REQUEST_CHANGES — The spec is well-written and architecturally sound as a document, but has several significant domain model gaps that would cause a third-party implementor to build something materially different from the existing system.
Validation
ruff check factory tests— PASS (all checks passed)mypy factory/— PASS (no issues in 73 source files)pytest -v -x— 309 passed, 1 failed — the single failure (test_interactive_task_contains_idea_text) is pre-existing onmain, not introduced by this PR
HIGH — Should block merge
1. No Experiment entity in the domain model (Section 4)
The existing implementation revolves around ExperimentRecord as a first-class domain object (factory/models.py:314-330) — each has an ID, hypothesis, before/after scores, evidence, and verdict. The spec has WorkItem, Evidence, and Decision but no container entity that groups a hypothesis + its execution contract + collected evidence + the resulting decision into a single trackable attempt.
This is the central entity that ties the lifecycle together. Without it, the spec describes a generic "run agent, check results, decide" loop rather than the observe→hypothesize→experiment→learn cycle that defines re:factory.
Suggested fix: Add Section 4.X Experiment with fields: experiment_id, project_id, work_item_id, hypothesis, contract_ref, evidence_refs, decision_ref, score_before, score_after, status.
2. Strategy/hypothesis generation is completely absent (Section 3.1)
The implementation has a full strategy layer (factory/strategy.py) with the FEEC priority heuristic, stuck detection, plateau detection, anti-pattern matching, and hypothesis ranking. These are what make the system autonomous rather than a one-shot agent runner.
The spec's Work Item Source covers intake of external work, but the system also generates work internally through observation and strategizing. A conforming implementation reading only the spec would have no guidance on this.
Suggested fix: Add a component (e.g., Strategy Engine or Hypothesis Generator) to Section 3.1 covering autonomous work selection, prioritization heuristics, and anti-pattern avoidance.
3. Discovery/bootstrap phase missing from the lifecycle (Section 5)
The lifecycle is Intake → Scope → Dispatch → Execute → Validate → Decide → Publish → Learn → Resume. The existing implementation has an explicit Discovery phase (factory/discovery/) that detects project language/framework, builds an EvalProfile, and generates eval scripts. This is the prerequisite step before any meaningful execution can occur on a new project. It's tied to the NO_FACTORY → HAS_FACTORY state transition.
Suggested fix: Add a Bootstrap or Discover phase, or make Discovery an explicit sub-step of Scope with normative language about project introspection.
4. Decision kinds don't match the implementation (Section 4.9)
Spec defines: keep, revert, merge, park, retry, escalate, error. Implementation has: Literal["keep", "revert", "error"] (three values). The spec introduces merge, park, retry, escalate without any definition of what they mean, when they apply, or whether they're required.
Suggested fix: Either mark the additional kinds as OPTIONAL extensions with definitions, or reduce the core set to keep, revert, error with a note that implementations MAY extend the vocabulary.
MEDIUM — Should fix before or shortly after merge
5. No CEO/Orchestrator concept (Section 3.1.5) — The Lifecycle Coordinator partially covers the CEO agent, but the role-based delegation model (Researcher, Strategist, Builder, Reviewer, Evaluator, Archivist, Distiller, Refiner) is architecturally significant and absent. Add a note that the coordinator MAY delegate to specialized sub-workers with defined roles.
6. ExecutionContract missing key fields (Section 4.5) — The implementation's effective contract includes eval_threshold, guards, smoke_test, hypothesis_budget, research_target, and eval_weights. Add normative language that contracts SHOULD include acceptance thresholds, validation weight policies, and budget limits.
7. No Eval/Score entity (Section 4) — The implementation has CompositeScore, EvalResult, EvalProfile, EvalDimension as structured entities. The spec subsumes all of this under "Evidence." Quantitative evaluation with weighted scoring, thresholds, and before/after comparison deserves explicit acknowledgment as a distinguished category of evidence.
8. Interactive and Refine modes not covered (Section 5) — The linear lifecycle doesn't account for Interactive mode (conversational ideation before building) or Refine mode (post-cycle refinement loop). Add a section on operational modes that describes how the lifecycle applies differently.
9. Conformance requirements not testable (Section 9.1) — Requirements like "represent work as work items" and "preserve durable memory" are too vague to verify. Add measurable criteria: e.g., "A conforming implementation MUST persist at minimum: work item identity, decision outcome, and at least one piece of supporting evidence per decision. These records MUST survive process restart."
LOW — Nice to have
10. implementation-defined formatting inconsistent — use consistent backtick formatting to signal the "MUST document" obligation applies.
11. StateBinding examples (Section 4.3) mix storage substrates with external system bindings — split into two groups.
12. cli-local profile (Section 6.1) underspecified — add minimum CLI capabilities.
13. Merge policies (Section 7) declared but not cross-referenced from domain model entities.
14. No versioning or evolution strategy for the spec itself.
Bottom Line
The spec is well-structured and clearly written. The RFC 2119 usage is mostly correct, the abstraction layers are a good organizing principle, and the multi-user state reconciliation design is forward-thinking. However, the four HIGH items represent real gaps: a third-party implementor reading only this spec would build a generic "bounded execution with guardrails" framework, not the self-evolving observe→hypothesize→experiment→learn system that re:factory actually is. The Experiment entity and Strategy component are the conceptual heart of the system and must be in the spec.
|
Context for the review below: I've been testing the Claude Fable model for triaging the open PR backlog. It flagged the following on this PR, and this CEO review session independently verified or refuted each claim against the current branch head. Process context: #517.
|
osilkin98
left a comment
There was a problem hiding this comment.
❌ Factory Review: REVERT
Verdict: REVERT
Reason: The spec's domain model shares zero entity names with the implementation and would retroactively declare the working system non-conformant on merge day.
Precheck Gate
5 claims adjudicated: 5 confirmed, 0 refuted. 1 prior verdict reconciled.
Code Review Notes
- Claims and provenance are in my comment above. Each note below is this review's independent adjudication of one claim.
- CLAIM 1 CONFIRMED: The spec's core domain model (Project, RepoBinding, WorkItem, ExecutionContract, ContractBuilder, ProjectResolver, DeploymentProfile, ArtifactEmitter, StateRecord, StateConflict) has zero implementing code in the repo. Evidence: grep for all entity names across factory/ returns no hits. Why it matters: Section 9.1 conformance requirements are vacuously false of the current codebase on merge day.
- CLAIM 2 CONFIRMED: The 9-stage lifecycle (Intake through Resume) is a superset of the actual cycle, adding Publish (not implemented) while omitting Discovery, autonomous hypothesis generation, and multi-gate validation. Evidence: factory/state.py defines 5 ProjectState values; the CEO prompt defines observe, hypothesize, build, guard, eval, precheck, decide, archive as the real cycle. Why it matters: The spec describes a different, more generic lifecycle than the one the system actually runs.
- CLAIM 3 CONFIRMED: Merging SPEC.md with RFC 2119 MUST/SHALL language retroactively declares the working system non-conformant. Evidence: PR body references "phase-0-harness-abstraction-full" as a preserved implementation branch. Section 9.1 requires execution contracts, work items, and project bindings that do not exist in code. Why it matters: Creates ambiguity about whether the current system is "broken" or the spec is aspirational.
- CLAIM 4 CONFIRMED: The PR has no problem statement explaining what cannot be done today without this spec. Evidence: No CONTRIBUTING.md or PR template exists in the repo. The PR body Summary has three bullets but no "why" for the spec's existence. Why it matters: Reviewers cannot evaluate whether a 535-line normative spec is the right artifact for the unstated problem.
- CLAIM 5 CONFIRMED: "Deployment Profile" and "Artifact Emitter" name nonexistent things. Evidence: grep returns zero hits in factory/. The implementation uses factory/runners/protocol.py RunnerMeta and concrete runners (claude.py, bob.py, codex.py, opencode.py). Why it matters: Vocabulary mismatch between spec and implementation increases cognitive load for contributors who will search for "Deployment Profile" and find nothing.
- PRIOR VERDICT: osilkin98 posted REQUEST_CHANGES with 4 HIGH blockers (missing Experiment entity, missing Strategy layer, missing Discovery phase, decision kinds mismatch), 5 MEDIUM, and 5 LOW items. All 4 HIGH items remain unaddressed at head c826617 (no new commits since that review). This review independently confirms all 4 HIGH items. The prior verdict is correct and still applies.
- ADDITIONAL: The word "experiment" appears exactly once in the 535-line spec (line 327, in a Memory examples list). ExperimentRecord (factory/models.py:314-330) is the central domain entity tying hypothesis to execution to evidence to decision. Its absence is the single largest gap in the spec.
- ADDITIONAL: Section 4.9 introduces 4 undefined decision kinds (merge, park, retry, escalate) beyond the 3 implemented ones (keep, revert, error at factory/models.py:328). Undefined vocabulary in a normative spec invites divergent interpretations.
- ACCEPTANCE PATH: (1) Add an Experiment entity to Section 4 with fields mapping to ExperimentRecord. (2) Add a Strategy Engine component to Section 3.1 covering FEEC, hypothesis generation, and anti-pattern avoidance. (3) Add a Discovery or Bootstrap phase to Section 5. (4) Either reduce decision kinds to keep/revert/error or define the additional kinds with semantics and mark them OPTIONAL. (5) Add a problem statement explaining what this spec enables that the codebase alone does not. (6) Rename "Deployment Profile" to "Runner" and "Artifact Emitter" to something matching the implementation, or add a mapping table. These changes would flip the verdict to KEEP.
Posted by Factory CEO
Review: Grounding the Spec in the Current SystemThe thinking here is valuable — formalizing the factory's domain model, lifecycle, and component boundaries. But the spec as written describes a system that doesn't exist yet, and merging it with RFC 2119 normative language would retroactively declare the working factory non-conformant on day one. The Fable triage flagged this accurately. The good news: most of what this spec describes already exists in the codebase — it just has different names. Rather than an aspirational abstract spec, we can write a descriptive spec of the actual system using the same Symphony format (which is being adopted in #498 for project specifications). Here's what that looks like — same structure, grounded in real code. Mapping: Abstract Concepts → Current Implementation
Lifecycle: Abstract vs ActualThe spec's 9-stage lifecycle maps directly to the CEO's state machine:
The lifecycle already exists. The factory already does all 9 stages. It just doesn't name them with these labels. What a Grounded SPEC.md Could Look LikeUsing the Symphony format (numbered sections, RFC 2119, grounded in code), here's a sketch of the key sections: # re:factory — Specification
## Normative Language
RFC 2119 keywords...
## 1. Problem Statement
re:factory is a meta-harness that orchestrates coding agents through
bounded, measurable, reversible improvement cycles. It accepts work
(hypotheses, backlog items, specs), dispatches specialist agents under
scoped constraints, validates results through multi-layer guardrails,
and records explicit keep/revert decisions with durable evidence.
## 2. Goals and Non-Goals
### 2.1 Goals
- Orchestrate coding agents through a repeatable experiment lifecycle
- Make every change measurable (eval scores) and reversible (git revert)
- Separate orchestration (CEO) from execution (specialists)
- Support multiple coding agent backends (Claude, Codex, Bob, OpenCode)
- Preserve institutional memory across cycles (archive, playbooks)
- Self-improve agent behavior through experiment outcomes (ACE)
### 2.2 Non-Goals
- Multi-repo projects (single repo per project today)
- Multi-user concurrent state (single-operator model)
- Managed service or hosted control plane
- Rich web UI beyond the optional dashboard
## 3. System Overview
### 3.1 Three-Layer Architecture
1. Python CLI (factory/) — pure tools, no decisions
2. CEO Agent (factory/agents/prompts/ceo.md) — orchestrator, owns lifecycle
3. Specialist Agents (factory/agents/) — 8 roles, spawned via factory agent
### 3.2 Runner Abstraction
Worker runtimes are switchable via factory/runners/:
- claude (default), bob, codex, opencode
- Selected via FACTORY_RUNNER env var or --runner flag
- Runner MUST support: prompt injection, stdout capture, timeout, exit code
## 4. Core Domain Model
### 4.1 ProjectState (factory/models.py)
Enum: no_repo | incomplete | no_factory | evals_pending_review | has_factory
Detected by factory/state.py, drives CEO mode routing.
### 4.2 FactoryConfig (factory/models.py)
Parsed from factory.md. Fields: goal, scope, guards, eval_command,
threshold, target_branch, clean_pr, research_target, mutable_surfaces,
fixed_surfaces. Strict Pydantic model.
### 4.3 ExperimentRecord (factory/models.py)
One record per experiment. Fields: id, timestamp, hypothesis,
change_summary, issue_number, pr_number, score_before, score_after,
verdict, notes. Stored as TSV row in .factory/results.tsv.
### 4.4 EvalProfile + EvalDimension (factory/models.py)
Auto-discovered eval dimensions. 6 hygiene + 5 growth + optional
project-specific dimensions. Composite score = weighted average.
...The rest would follow the same pattern: each section maps to actual code, with file paths and model names. The Reference Algorithms section would cover FEEC prioritization ( Recommendation
The value of a grounded spec: anyone reading it can Happy to help write the full grounded version if this direction makes sense. |
Summary
SPEC.mdas a normative, language-agnostic meta-harness specificationValidation
uv run ruff check factory testsuv run mypy factory/Notes:
uv run ruff check .is blocked by unrelated untracked local files:scripts/build_issue_triage.pyandskills/triage/.uv run pytest -vreached 73% before timeout and hit an unrelated existing runner E2E failure:tests/test_runner_e2e.py::test_builder_makes_changes[codex].Saved follow-up branch
The previous full Phase 0 implementation work was preserved on
phase-0-harness-abstraction-full.