feat: rename interactive mode to design + Symphony SPEC.md output by akashgit · Pull Request #498 · akashgit/remote-factory

akashgit · 2026-06-05T22:09:51Z

Factory experiment 2. Closes #440 #497.

akashgit · 2026-06-05T22:14:09Z

❌ Factory Review: REVERT

Verdict: REVERT
Reason: Guard violation: branch not rooted at declared baseline (merge-base 0bd6ed7 != baseline 61d86a7). No eval scores recorded for before/after comparison.

Experiment: #2
Hypothesis: Rename interactive mode to design mode and adopt Symphony-style SPEC.md output

Score Comparison

Metric	Value
Before	0.0000
After	0.0000
Delta	+0.0000
Threshold	0.6000

Guard Checks

Check	Result
eval_immutable	✅ PASS
scope	✅ PASS
baseline	❌ FAIL

Precheck Gate

VIOLATION: Branch is not rooted at baseline 61d86a7a (merge-base: 0bd6ed77). Eval scores not recorded.

Code Review Notes

Code quality is clean — ~125 consistent renames from interactive→design, backward compat alias in cmd_ceo(), argparse choices preserved, step labels I0-I4→D0-D4, Symphony SPEC.md template replaces idea.md format. No bugs, no security issues. Tests updated and new backward compat test added.
Guard violation is the sole blocker — branch was created from 0bd6ed7 (PR ci: allow Cole and Shiv to trigger @ceo-review #482) but baseline declared as 61d86a7 (PR ci: allow osilkin98 and gx-ai-architect to trigger @ceo-review #495). Rebasing onto current main and re-running evals would resolve this.

Posted by Factory CEO

codecov · 2026-06-05T22:36:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.77%. Comparing base (5985563) to head (854f955).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #498   +/-   ##
=======================================
  Coverage   86.77%   86.77%           
=======================================
  Files          64       64           
  Lines       10027    10029    +2     
=======================================
+ Hits         8701     8703    +2     
  Misses       1326     1326

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

akashgit · 2026-06-05T22:49:52Z

✅ Factory Review: KEEP

Verdict: KEEP
Reason: Rename interactive→design mode + Symphony SPEC.md output. All 2218 tests pass. Score held: 0.782→0.782. Precheck: all 4 gates pass. Code review: CLEAN on structured + headless review.

Experiment: #2
Hypothesis: Rename interactive mode to design mode and adopt Symphony-style SPEC.md output

Score Comparison

Metric	Value
Before	0.7820
After	0.7822
Delta	+0.0002
Threshold	0.6000

Guard Checks

Check	Result
scope	✅ PASS
eval_immutable	✅ PASS

Posted by Factory CEO

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The interactive→design rename changed the distiller output from Vision/Core Features/Architecture to numbered Symphony sections. Update test_has_output_format to check for the new section headers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

akashgit · 2026-06-09T02:46:48Z

@ceo-review

github-actions

❌ Factory Review: REVERT

Verdict: REVERT
Reason: Incomplete renaming — documentation files not updated

Code Review Notes

Core code changes excellent and fully backward-compatible, but README.md, docs/*.md, and CHANGELOG.md still reference 'interactive mode' and 'idea.md'. Update docs to use --mode design as primary flag.
Symphony SPEC.md format is well-structured with RFC 2119 normative language — good improvement for buildability
Backward-compat alias (interactive→design) properly implemented in cli.py:2293-2294
All Python tests updated correctly, including new backward-compat test at test_cli.py:825

Posted by Factory CEO

github-actions

✅ Factory Review: KEEP

Verdict: KEEP
Reason: Code implementation complete and backward-compatible. Documentation gap to be addressed in follow-up issue.

Posted by Factory CEO

osilkin98 · 2026-06-11T02:59:29Z

@akashgit triage: #492, #494, and #498 are three generations of the same change (interactive→design rename + Symphony SPEC.md). Since they're all yours — which one should survive? Happy to close the other two once you pick. For what it's worth, #492 has the most precise scope description (explicitly excludes the runner-concept references), while this one is the newest.

akashgit · 2026-06-13T16:52:15Z

Proposal: Merge Distiller into Strategist + Standardize on SPEC.md

After studying this PR and thinking through the implications, here's a proposal for how to evolve the design mode work further.

The Problem

Right now, when a user goes through design mode and approves a spec, the system re-plans the work downstream. The Strategist in Build mode re-decomposes the spec into phases, and in that process, things get moved to the backlog that the user already approved. The spec is treated as a suggestion, not a contract. This causes scope erosion — the user approved 10 features but only 6 get built.

The root cause: there's a redundant re-planning step between the user-approved spec and the Builder. The Strategist re-interprets work the user already signed off on.

Proposal

1. Merge the Distiller agent into the Strategist.

The Distiller and the Strategist (in Build mode) do complementary work on the same artifact:

Distiller: synthesizes research + raw idea → structured spec (what to build, how, why)
Strategist (Build mode): takes the spec → decomposes into ordered phases (what order, what scope per PR, what's blocked)

These should be one agent. The Strategist already knows how to prioritize, order by dependency, and scope to one-PR-per-phase. Teaching it to also write a spec is easier than teaching the Distiller strategic thinking.

In design mode, the Strategist would:

Read the research (from the Researcher, which still runs first)
Synthesize the research + raw idea into a full SPEC.md (what the Distiller does today)
Add strategic decomposition: dependency ordering, phase scoping, prioritization (what the Build-mode Strategist does today)
Produce a single artifact: SPEC.md with an Implementation Plan section

The user iterates on this in the design loop — they see not just the features but the build order. Once approved, the SPEC.md is a contract.

The Distiller agent gets retired. Its prompt gets folded into the Strategist's design-mode behavior.

2. Standardize the Strategist's output to SPEC.md format across all modes.

Instead of the Strategist producing different formats in different modes (current.md with hypotheses in Improve mode, phased plans in Build mode, etc.), it always produces a SPEC.md. The format adapts to the context but the structure is consistent.

3. Eliminate the Strategist and Researcher steps in Build mode (B0, B1) when a user-approved SPEC.md exists.

If the user already approved a SPEC.md through design mode, the CEO reads the Implementation Plan directly and feeds phases to the Builder. No re-research, no re-planning, no opportunity to downscope.

The Hybrid SPEC.md Format

This combines what this PR proposes for the Distiller's Symphony output with the Strategist's build-planning capabilities. The top half is the spec (Symphony format from this PR). The bottom half is the strategic decomposition (what the Strategist currently puts in current.md).

# Project Name — Specification

## Normative Language
RFC 2119 keywords...

## 1. Problem Statement
What problem this solves and why it matters.

## 2. Goals and Non-Goals
### 2.1 Goals
- ...
### 2.2 Non-Goals
- ...

## 3. System Overview
### 3.1 Architecture
- ...
### 3.2 Tech Stack
- Language/framework choices with rationale grounded in research

## 4. Core Domain Model
Key entities and their relationships.

## 5. Detailed Specification
### 5.1 Feature: Location Lookup
- **What:** User-visible behavior
- **How:** Implementation approach — libraries, data flow
- **Why:** Research-grounded rationale

### 5.2 Feature: Forecast Display
- **What:** ...
- **How:** ...
- **Why:** ...

## 6. Reference Algorithms
Any non-trivial algorithms or protocols.

## 7. Test and Validation Matrix
How to verify each feature works.

## 8. Implementation Plan

### Phase 1: Project scaffold + eval harness
- [ ] Initialize repo, pyproject.toml, dependencies
- [ ] Create eval/score.py with baseline dimensions
- [ ] Set up CI configuration
- **Scope:** one PR
- **Priority:** FIX (foundation must exist first)

### Phase 2: Core data model + location lookup (§5.1)
- [ ] Implement Location model
- [ ] Implement geocoding API client
- [ ] Add unit tests for location resolution
- **Depends on:** Phase 1
- **Scope:** one PR
- **Priority:** EXPLORE

### Phase 3: Forecast display + CLI (§5.2)
- [ ] Implement forecast rendering
- [ ] Add CLI argument parsing
- [ ] Add error handling for API failures
- **Depends on:** Phase 2
- **Scope:** one PR
- **Priority:** EXPLORE

### Blocked (requires user input)
- Stripe billing — needs STRIPE_API_KEY from user
- Deployment target — user must choose hosting provider

Key differences from the current Symphony Implementation Checklist:

Phases are grouped and ordered by dependency, not a flat checkbox list
Each phase is scoped to one PR
Each phase has a FEEC priority tag
Phases cross-reference the Detailed Specification sections (§5.1, §5.2)
There's a Blocked section for things that genuinely need human input (not a dumping ground for deferred work)

What Changes in Each Mode

Design mode (new projects):

Researcher runs first (unchanged)
Strategist replaces the Distiller — synthesizes research into SPEC.md with Implementation Plan
User iterates on the full SPEC.md (features AND build order)
Once approved, transitions to Build mode

Design mode (existing projects):

Same flow, but the SPEC.md is an improvement spec scoped to the changes
Implementation Plan contains the specific changes to make

Build mode:

When a user-approved SPEC.md exists: skip B0 (Researcher) and B1 (Strategist). The CEO reads the Implementation Plan section and feeds phases directly to the Builder.
When no SPEC.md exists (e.g., factory ceo /path without --mode design): the current flow stays — Researcher → Strategist → Builder. The Strategist produces a SPEC.md internally.

Improve mode:

No changes to the Improve mode flow. The Strategist still reads eval data, experiment history, and backlog to generate hypotheses. The output format could adopt the SPEC.md structure (observations map to Problem Statement, hypotheses map to Implementation Plan phases), but the function is unchanged.

What Changes in the CEO Prompt

Remove all Distiller invocations — replace with Strategist invocations in design mode
In Build mode: when SPEC.md exists with an Implementation Plan section, skip B0/B1 and go directly to B3 (Builder). The CEO reads phases from the Implementation Plan.
The review gates (B3r, code quality, guard checks) stay identical — they review the Builder's output, not the plan.
GitHub issue creation reads from SPEC.md phases instead of current.md hypotheses.

Impact on This PR

This PR's rename from interactive → design and the Symphony format are both the right direction. The proposal here builds on top of them:

The Symphony SPEC.md format from this PR becomes the base format
The Implementation Checklist section gets upgraded to the phased Implementation Plan described above
The Distiller prompt (factory/agents/prompts/distiller.md) gets merged into the Strategist prompt
The CEO prompt's Phase 0 section invokes the Strategist instead of the Distiller
The CEO prompt's Build mode section skips B0/B1 when SPEC.md exists

The rename and backward-compat alias from this PR are good as-is. The format and agent changes would be follow-up work on top of this PR's foundation.

akashgit · 2026-06-13T17:02:45Z

Follow-up: Standardizing on SPEC.md — Impact Analysis and Testing Plan

Building on the proposal above, here's the detailed breakdown of what actually changes, what doesn't, and how to make sure we don't break anything.

The Core Insight

The only thing that changes is the output format of the Strategist. The Strategist's logic — FEEC prioritization, growth/hygiene balance, backlog convergence, stuck protocol, design space scoring — all stays identical. We're reformatting the output, not rewriting the brain.

The Python code (factory/strategy.py, factory/models.py, factory/store.py, factory/cli.py) doesn't change at all. It treats hypothesis as an opaque string everywhere — ExperimentRecord.hypothesis is just a str, categorize_hypothesis() does keyword matching on free text, hypothesis_similarity() does Jaccard similarity on words. None of it parses markdown structure. We pass the phase description as the hypothesis string and everything works.

What the SPEC.md Format Looks Like in Improve Mode

The key constraint: this format must be friendly to the existing CEO review logic. All the tags the CEO currently checks for (**Growth dimension:**, **Category:**, **Backlog item:**, **Type:** operational, **Execution step:**) stay exactly as they are. They just live inside SPEC.md sections instead of hypothesis blocks.

# Improvement Cycle — Specification

## 1. Current State
- Composite: 0.72
- Weakest: observability (0.3)
- Last 3 experiments: #5 keep (+0.02), #6 revert (-0.01), #7 keep (+0.03)
- Pattern: observability consistently underserved

## 2. Goals and Non-Goals
### 2.1 Goals
- Improve observability from 0.3 to 0.6
- Fix flaky auth test
### 2.2 Non-Goals
- Not optimizing API latency this cycle

## 3. Design Space
| Dimension | Score | Notes |
|---|---|---|
| Features | 4 | Well-explored |
| Instrumentation | 1 | Underserved |
| ... | ... | ... |

**Underserved:** Instrumentation, Operational execution, Knowledge management

## 4. Detailed Specification

### 4.1 Fix flaky auth test
- **What:** Mock the external OAuth endpoint in test suite
- **How:** Use responses library to stub OAuth token endpoint
- **Why:** Test suite fails intermittently, blocking reliable evals
- **Category:** FIX
- **Expected impact:** tests 0.8→0.9
- **Priority:** high

### 4.2 Add structured logging
- **What:** Add structlog to payment, auth, API modules
- **How:** Replace print statements with structlog, add request ID middleware
- **Why:** Observability is weakest dimension at 0.3
- **Category:** EXPLOIT
- **Backlog item:** add logging to API modules
- **Growth dimension:** observability
- **Expected impact:** observability 0.3→0.6
- **Priority:** high

## 5. Implementation Plan

### Phase 1: Fix flaky auth test (§4.1, FIX)
- [ ] Add responses mock for OAuth endpoint
- [ ] Verify test passes 10 consecutive runs
- **Scope:** one PR

### Phase 2: Add structured logging (§4.2, EXPLOIT)
- [ ] Add structlog to payment module
- [ ] Add structlog to auth module
- [ ] Add structlog to API module
- [ ] Add request ID middleware
- **Depends on:** Phase 1
- **Scope:** one PR

## 6. Anti-patterns
- Don't retry the same prompt change (reverted 3x)

## 7. Blocked (requires user input)
- (none this cycle)

## 8. Proposed Backlog Additions
- Add rate limiting to API

Notice the tags are identical to today's hypothesis format — **Category:** FIX, **Growth dimension:** observability, **Backlog item:** ..., **Type:** operational, **Execution step:**. The CEO's review criteria don't need new logic, they just look in ## 4. Detailed Specification sections instead of #### H1: blocks.

The Implementation Plan (§5) adds the dependency ordering and phase scoping that the CEO currently gets from the Strategist's hypothesis ordering. In Improve mode, this is mostly the same as FEEC ordering — FIX phases first, then EXPLOIT, then EXPLORE. The cross-references (§4.1, §4.2) connect phases back to the detailed spec.

Exact Scope of Changes

Prompt files (the real work):

File	Change	Detail
`strategist.md`	Rewrite output section	Replace the hypothesis template (lines 98-147) with SPEC.md template. Add the Distiller's spec-writing and grounding protocol for design mode. All logic sections (FEEC, growth/hygiene, backlog priority, stuck protocol, design space, research mode) stay word-for-word.
`ceo.md`	~15 string replacements + design mode routing	"For each hypothesis" → "For each phase in the Implementation Plan". `## New Backlog Items` → `## Proposed Backlog Additions`. `## Deferred` → `## Blocked`. Design mode: invoke Strategist instead of Distiller. Build mode: skip B0/B1 when user-approved SPEC.md exists.
`distiller.md`	Delete	Capabilities merged into Strategist.
`builder.md`	1 line	"translates hypotheses" → "translates specifications"
`reviewer.md`	1 line	"the experiment hypothesis" → "the experiment specification"
`evaluator.md`	1 line	Wording only

Python code (no functional changes):

File	Change
`factory/cli.py`	Remove `distiller` from agent role list if registered separately. Everything else passes strings — no format parsing.
`factory/models.py`	None. `ExperimentRecord.hypothesis` stays as `str`.
`factory/strategy.py`	None. `categorize_hypothesis()` and `hypothesis_similarity()` work on free text.
`factory/store.py`	None. TSV stores strings.

Testing Plan — Mode by Mode

This is a format change to the Strategist's output. Every mode that reads the Strategist's output must be explicitly tested to verify nothing breaks.

1. Improve mode (MOST CRITICAL — must not regress)

Improve mode works well today. The format change must be invisible to the downstream pipeline. Test:

2. Design mode (new projects)

This is the mode that benefits most from the change. Test:

Strategist (replacing Distiller) synthesizes research + raw idea into full SPEC.md
SPEC.md includes both specification sections (Problem Statement, Goals, System Overview, Detailed Spec) AND Implementation Plan with phased ordering
User can iterate on the SPEC.md with feedback (refinement loop works)
Grounding protocol enforced — research citations in every feature
Once approved, CEO transitions to Build mode and reads the Implementation Plan directly
B0 (Researcher) and B1 (Strategist) are skipped — no re-planning
Builder receives phases from the approved SPEC.md
No scope erosion — everything the user approved gets built
## Blocked section only contains items genuinely blocked on human input

3. Design mode (existing projects)

Strategist produces an improvement-scoped SPEC.md (not a full project spec)
Current State section reflects eval scores and recent history
Implementation Plan is scoped to the improvement, not the whole project
Transitions to Improve mode with the approved spec as focus

4. Build mode (without design mode — e.g., factory ceo /path on a new project)

When there's no user-approved SPEC.md, the Strategist still runs at B1 and produces a SPEC.md internally. Test:

Strategist produces a valid SPEC.md with Implementation Plan at B1
CEO reviews the SPEC.md with the same criteria (phase scoping, deferral strictness)
Builder reads phases from the Implementation Plan
## Blocked section replaces ## Deferred — same semantics, same strictness

5. Research mode

Research mode has its own Strategist template with **Failure mode:** and **Mutable surface:** fields. Test:

These fields appear in ## 4. Detailed Specification sections
Surface constraints are preserved in the SPEC.md
CEO review correctly validates surfaces are within mutable set
Phase ordering follows research-mode FEEC (FIX is primary)

6. Meta mode

Meta mode is Improve mode + ACE playbook evolution. If Improve mode works, Meta mode should work. Test:

Full Meta cycle completes with SPEC.md format
ACE reads experiment history correctly (hypothesis strings are just phase descriptions)

7. Backward compatibility

--mode interactive still works (alias from this PR)
Existing .factory/strategy/current.md files from prior runs don't crash anything (CEO handles old format gracefully during resume)
factory history displays phase descriptions readably

Recommended Implementation Order

Merge this PR first (rename to design + Symphony format) — it's the foundation
Update the SPEC.md Implementation Checklist to the phased Implementation Plan format (add dependency ordering, scope per phase, FEEC tags, ## Blocked section)
Rewrite the Strategist's output section to produce SPEC.md in all modes, keeping all logic sections unchanged
Merge the Distiller's spec-writing capabilities (grounding protocol, What/How/Why structure, research config for research mode) into the Strategist prompt
Update the CEO prompt — ~15 touchpoints where "hypothesis" → "phase", section name changes, and design mode routing
Delete distiller.md
Run the full testing plan above — especially Improve mode e2e
Update CLAUDE.md — architecture docs, agent list, .factory/ layout

Steps 2-6 can be one PR. Step 7 gates the merge.

akashgit · 2026-06-13T22:09:45Z

The Distiller+Strategist merge proposal from this thread is now tracked as #523 and is being implemented. The scope is narrower than the full SPEC.md standardization discussed here — just merge the two agents, skip the redundant B0+B1 steps in interactive/research modes, and retire the Distiller. Format changes (SPEC.md etc) are follow-up work.

Related issues from the same discussion:

Parallelize Researcher spawning — fan out multiple researchers per invocation #524 — Parallelize Researcher spawning
Replace wall-clock timeout with inactivity-based timeout in agent runner #525 — Inactivity-based timeout replacing wall-clock timeout

akashgit force-pushed the factory/run-0e0b2fb8 branch from 788dcea to 3fbede3 Compare June 5, 2026 22:25

akashgit marked this pull request as ready for review June 5, 2026 22:49

akashgit and others added 2 commits June 9, 2026 02:44

feat: rename interactive mode to design mode + Symphony SPEC.md output

180d579

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

akashgit force-pushed the factory/run-0e0b2fb8 branch from 3d42a2e to 854f955 Compare June 9, 2026 02:46

github-actions Bot requested changes Jun 9, 2026

View reviewed changes

github-actions Bot approved these changes Jun 9, 2026

View reviewed changes

osilkin98 added competing Another open PR solves the same problem kind:capability Does something new stage:intent Capturing/protecting what the user wants (specs, scope, design) labels Jun 11, 2026

This was referenced Jun 11, 2026

feat: rename --mode interactive to --mode design + Symphony-style SPEC.md output #494

Closed

Rename --mode interactive to --mode design + adopt Symphony-style SPEC.md output #492

Closed

This was referenced Jun 13, 2026

docs: add meta-harness specification #423

Open

Merge Distiller into Strategist — eliminate redundant re-planning in interactive/research modes #523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: rename interactive mode to design + Symphony SPEC.md output#498

feat: rename interactive mode to design + Symphony SPEC.md output#498
akashgit wants to merge 2 commits into
mainfrom
factory/run-0e0b2fb8

akashgit commented Jun 5, 2026

Uh oh!

akashgit commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

akashgit commented Jun 5, 2026

Uh oh!

akashgit commented Jun 9, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

osilkin98 commented Jun 11, 2026

Uh oh!

akashgit commented Jun 13, 2026

Uh oh!

akashgit commented Jun 13, 2026

Uh oh!

akashgit commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akashgit commented Jun 5, 2026

Uh oh!

akashgit commented Jun 5, 2026

❌ Factory Review: REVERT

Score Comparison

Guard Checks

Precheck Gate

Code Review Notes

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

akashgit commented Jun 5, 2026

✅ Factory Review: KEEP

Score Comparison

Guard Checks

Uh oh!

akashgit commented Jun 9, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

❌ Factory Review: REVERT

Code Review Notes

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

✅ Factory Review: KEEP

Uh oh!

osilkin98 commented Jun 11, 2026

Uh oh!

akashgit commented Jun 13, 2026

Proposal: Merge Distiller into Strategist + Standardize on SPEC.md

The Problem

Proposal

The Hybrid SPEC.md Format

What Changes in Each Mode

What Changes in the CEO Prompt

Impact on This PR

Uh oh!

akashgit commented Jun 13, 2026

Follow-up: Standardizing on SPEC.md — Impact Analysis and Testing Plan

The Core Insight

What the SPEC.md Format Looks Like in Improve Mode

Exact Scope of Changes

Testing Plan — Mode by Mode

Recommended Implementation Order

Uh oh!

akashgit commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 5, 2026 •

edited

Loading