Skip to content

Concentrate agent autofix-boundary and report-reading guidance#55

Merged
pengfei-threemoonslab merged 4 commits intomainfrom
claude/sweet-mcclintock-c3ba52
May 8, 2026
Merged

Concentrate agent autofix-boundary and report-reading guidance#55
pengfei-threemoonslab merged 4 commits intomainfrom
claude/sweet-mcclintock-c3ba52

Conversation

@pengfei-threemoonslab
Copy link
Copy Markdown
Contributor

Summary

  • Add four agent-facing docs that concentrate guidance previously scattered across AGENTS.md, agent-contract-current.md, autofix-policy.md, agent-recipes.md, and target-repo-agent-snippets.md
  • Cross-link sweep across seven existing files so any agent-facing entry point reaches the new pages in two hops or fewer

Type

  • Check or risk-model change
  • Input adapter change
  • CLI or GitHub Action behavior
  • Report, schema, or SARIF output
  • Documentation only

Why

A new agent or CI integrator coming to Agents Shipgate had to read five files in the right order to learn (a) what an agent may safely do mechanically, (b) what conclusions an agent must defer to a human reviewer, and (c) how to walk report.json correctly. The guidance was complete but spread thin. This PR concentrates it into purpose-built pages and wires every agent-facing entry point through to them.

What's in each new doc

  • docs/agent-autofix-boundary.md — behavioral counterpart to the mechanical autofix-policy.md. Opens with the load-bearing distinction (autofix-policy.md answers "will apply-patches run this?"; this page answers "what may an agent assert in a PR comment?"). Includes a check-ID mapping table covering the seven categories (approval, confirmation, idempotency, broad-scope, prohibited-action, runtime trace evidence, plus an override-refusal script).
  • docs/report-reading-for-agents.md — reader's primer that walks report.json in order: release_decision.decision first, then supporting fields, then findings, then per-finding autofix flags, then packet human_in_the_loop. Anti-patterns include a concrete summary.statusrelease_decision.decision code rewrite. Full schema-version table.
  • docs/agents/use-with-codex.md — mirror of use-with-claude-code.md for OpenAI Codex. On-ramp is the canonical AGENTS.md snippet from target-repo-agent-snippets.md; no slash command or skill bundle to install.
  • docs/agents/use-with-cursor.md — mirror for Cursor. On-ramp is the auto-attach .cursor/rules/agents-shipgate.mdc rule; explains the globs: + alwaysApply: false mechanism.

Cross-link sweep

  • docs/INDEX.md — "For agents" lists all four new docs plus the previously-missing use-with-claude-code.md and agent-contract-current.md.
  • AGENTS.md — Task 2 points at the report primer; "What you can't do" leads with a CLI-vs-agent boundary clarification; the Claude Code paragraph at line 415 is now an "Editor / agent integrations" subsection covering all three editors.
  • docs/agent-contract-current.md — new "See also" section beneath "Authoritative references" (kept separate so schemas stay in the authoritative list and reader's-guides land in see-also).
  • docs/autofix-policy.md — "See also" leads with the boundary doc as the behavioral counterpart.
  • docs/agent-recipes.md — "Reference" list expanded.
  • prompts/README.md and docs/agents/use-with-claude-code.md — Codex/Cursor punts replaced with real links; Aider stays paste-only.

Vocabulary discipline

The canonical six-item phrase ("approval, confirmation, idempotency, broad-scope, or prohibited-action policy decisions") that already lived in target-repo-agent-snippets.md is now used verbatim in the four new docs, with runtime trace evidence added as the seventh category per the brief — flipping a trace patches the evidence record, not the runtime gate.

Verification

CI is authoritative for python -m ruff check ., python -m compileall -q src tests, and python -m pytest.

Additional local checks run:

  • Link integrity: extracted every relative link from the four new docs (grep -oE '\]\([^)]+\)') and confirmed every target file/anchor exists. All ten check-ID anchors (#ship-policy-approval-missing etc.) match ### SHIP-... headers in docs/checks.md.
  • Vocabulary parity: the canonical six-item phrase appears verbatim in the four new docs plus the existing target-repo-agent-snippets.md; "runtime trace evidence" is called out as the seventh.
  • Discoverability: release_decision.decision is now mentioned in 13 files across docs/, AGENTS.md, and prompts/README.md. Sampled three entry points (AGENTS.md Task 2, docs/INDEX.md "For agents", prompts/README.md) — each reaches both new top-level docs in one hop.
  • No regressions: git diff on the seven modified existing files is additive only (new lines, new bullets, new "See also" entries) plus the targeted "Editor / agent integrations" subsection split in AGENTS.md.

Release-readiness notes

  • No user-code import added to default scan paths
  • No network access added to default scan paths
  • New or changed check IDs are documented in docs/checks.md (no new check IDs; existing IDs are referenced)
  • Report/schema changes are additive or documented in STABILITY.md (no schema changes)

🤖 Generated with Claude Code

pengfei-threemoonslab and others added 4 commits May 8, 2026 15:34
Boundary and report-reading guidance was scattered across AGENTS.md,
agent-contract-current.md, autofix-policy.md, agent-recipes.md, and
target-repo-agent-snippets.md, so a new agent integrator had to hop
between files to learn what is mechanically safe vs. what requires
human review and how to walk report.json. Concentrate that guidance
into purpose-built pages and cross-link every agent-facing entry
point so the right reading order is hard to miss.

Add four docs: agent-autofix-boundary.md (behavioral counterpart to
the mechanical autofix-policy.md, with a check-ID mapping table and
override-refusal script), report-reading-for-agents.md (reader's
primer that walks report.json starting from release_decision.decision,
with concrete summary.status -> release_decision.decision rewrite),
agents/use-with-codex.md, and agents/use-with-cursor.md (mirrors of
use-with-claude-code.md for editors that lack a slash-command/skill
bundle; on-ramps are the existing AGENTS.md and .cursor/rules
snippets in target-repo-agent-snippets.md).

Cross-link sweep: docs/INDEX.md "For agents" lists all four new docs
plus the previously-missing use-with-claude-code.md and
agent-contract-current.md; AGENTS.md Task 2 points at the report
primer; AGENTS.md "What you can't do" leads with a CLI-vs-agent
boundary clarification; AGENTS.md gains an "Editor / agent
integrations" subsection covering all three editors;
agent-contract-current.md gets a See-also section; autofix-policy.md
See-also leads with the boundary doc as the behavioral counterpart;
agent-recipes.md References gains both new docs;
prompts/README.md and use-with-claude-code.md replace their Codex/
Cursor punts with real links.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three corrections to the new agent docs in PR #55:

- agent-autofix-boundary.md: SHIP-EVIDENCE-APPROVAL-TRACE-MISSING was
  miscategorized under "Approval policy." It fires when an
  approval-required tool is missing local HITL trace evidence — the
  boundary is about trace artifacts, not about the policy declaration.
  Move it to the runtime trace evidence row and broaden that row's
  handoff text to cover "trace evidence missing OR trace shows
  policy-controlled call without approval/confirmation."
- agent-autofix-boundary.md: prohibited-action row promised check IDs
  but listed only the manifest field. Add SHIP-SCOPE-PROHIBITED-TOOL-
  PRESENT as the canonical check ID an agent walking findings[] will
  actually see, with handoff text matching the check description.
- use-with-cursor.md: Cursor's auto-attach rules require a matching
  file in chat context (alwaysApply: false). The verify step said "in
  a fresh chat" which would not trigger the rule. Make the second
  step explicitly require the matching file be open or @-referenced,
  and add a sentence explaining alwaysApply: false is the intended
  behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Codex integration guide claimed Codex has no slash-command or
skill mechanism. That is stale: Codex Skills ship in the CLI, IDE
extension, and app, with bundles installable at .codex/skills/
(project-scoped) or ~/.codex/skills/ (user-scoped), invoked via
/skill-name or implicitly when Codex decides.

Reframe the integration story: this repo does not currently ship a
Codex skill bundle (the parallel to skills/agents-shipgate/ for
Claude Code has not been authored), so the AGENTS.md snippet is the
minimal on-ramp that works today — not an inherent Codex limitation.
Add a "What's next" section documenting the SKILL.md path
(.codex/skills/agents-shipgate/SKILL.md) and the building blocks
(prompts/ recipes, advisory CI workflow, agent-autofix-boundary.md)
for assembling one locally before this repo ships an official one.

Update AGENTS.md "Editor / agent integrations" entry to match.

Source: https://developers.openai.com/codex/skills

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rageDecision fields

Two factual errors in PR #55:

- Codex skill paths/invocation. The doc said .codex/skills/ and
  ~/.codex/skills/ with /skill-name invocation. Per current OpenAI
  docs, Codex scans .agents/skills/ in every directory from the
  working directory up to the repo root, plus $HOME/.agents/skills/
  for user-scoped skills, and invocation is /skills or $<skill-name>
  (or implicit when Codex matches the task). Following the wrong
  paths would put SKILL.md in a directory Codex never scans. Fix all
  four occurrences (use-with-codex.md intro, surface table, What's
  next install location, AGENTS.md Editor integrations entry).

- EvidenceCoverageDecision schema. report-reading-for-agents.md
  documented release_decision.evidence_coverage.{level,
  human_review_recommended, warnings} but the v0.10 schema has no
  warnings field. The actual fields are level,
  human_review_recommended, low_confidence_tool_count, and
  source_warning_count (docs/report-schema.v0.10.json:275-302).
  Replace warnings with the two real count fields.

Source: https://developers.openai.com/codex/skills

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pengfei-threemoonslab pengfei-threemoonslab merged commit e00ddc5 into main May 8, 2026
1 check passed
@pengfei-threemoonslab pengfei-threemoonslab deleted the claude/sweet-mcclintock-c3ba52 branch May 8, 2026 23:31
pengfei-threemoonslab added a commit that referenced this pull request May 8, 2026
Picks up: contract CLI command (#52), v0.10.0 release tag (#53),
init --agent-instructions (#54), HITL evidence provenance (#50),
agent autofix-boundary docs (#55), packet schema v0.3.

Conflict resolution: kept v0.11 report-schema references on top of
main's v0.10.0 release / packet-schema v0.3 / contract-command
additions. AGENTS.md and SKILL.md adopt main's centralized
"contract lives in agent-contract-current.md" pattern; the v0.11
provenance line lives there now. test_public_surface_contract.py
adopts main's derive-from-model approach for the current schema
constants and just adds v0.10 to the legacy-pattern list.

Also fixes a SARIF regression flagged in review: ``_location()``
chose the structured branch whenever ``source.path`` was set, so a
finding with ``path="foo.py"`` and legacy
``location="foo.py:10"`` emitted no ``region``. Hybrid / plugin
findings now fall back to ``_split_location(source.location or
source.ref)`` when ``start_line`` is absent. Adds a regression
test.

After merge: 805 passed (+3 skipped), ruff clean,
``agents-shipgate contract --json`` reports
``report_schema_version: "0.11"``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pengfei-threemoonslab added a commit that referenced this pull request May 9, 2026
CI on PR #56 caught the stale llms-full.txt: the merge of origin/main
brought in three commits (#54 init --agent-instructions, #55 autofix-
boundary docs, #51 v0.11 schema with source provenance) that touched
AGENTS.md and the report-schema literals, so the committed llms-full.txt
no longer matched what `scripts/build-llms-full.py` produces from the
post-merge sources.

Re-ran the build script and committed the regenerated file. Now:

- 859 tests pass + 3 skipped (was 748 on this branch; +111 from main's
  new tests for source provenance, agent-instructions renderers,
  managed blocks, etc.)
- ruff clean
- llms-full.txt is byte-identical to `render(REPO_ROOT)`, so
  test_llms_full_is_up_to_date passes

No code changes — purely the rebuilt artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pengfei-threemoonslab added a commit that referenced this pull request May 9, 2026
* Add trigger catalog, zero-install detector, and adoption benchmark

Shipping the first two weeks of the agent-adoption strategy in one
PR. The unifying goal is to make Agents Shipgate self-discoverable to
coding agents (Claude Code, Codex, Cursor) that haven't already
adopted it: give them machine-readable trigger rules, a zero-install
path to verify relevance, and a public benchmark that measures whether
adoption-improving edits are working.

Trigger surface:

- docs/triggers.json — machine-readable mirror of the AGENTS.md
  trigger table. Every rule carries an `agents_md_row` field that the
  contract test pins against the AGENTS.md prose, so the two cannot
  drift. Stable for 0.x.
- src/agents_shipgate/triggers.py — canonical loader and evaluator
  with predicate vocabulary documented in triggers.json itself.
  `python -m agents_shipgate.triggers shipgate.yaml prompts/refund.md`
  returns a run/skip verdict plus matched rules.
- prompts/decide-shipgate-relevance.md — relevance-decision prompt
  that walks an agent through fetching triggers.json and applying it
  to a PR diff before any other Shipgate prompt fires.
- AGENTS.md, llms.txt, .well-known/agents-shipgate.json, pyproject.toml,
  README — cross-link the new surface so every entry surface points
  at every other.

Long-form reference:

- llms-full.txt — concatenated AGENTS.md + recipes + contract +
  checks + concepts + autofix-policy in one document for AI search
  engines and coding agents that prefer one fetch.
- scripts/build-llms-full.py — deterministic generator; the
  contract test fails if a source file changes without regenerating.

Zero-install path:

- tools/shipgate-detect.py — stdlib-only Python detector that
  replicates the structural verdict of `agents-shipgate detect --json`
  without requiring a local install. Pinned to the canonical CLI by
  tests/test_zero_install_detector.py across all 8 sample fixtures
  (same is_agent_project, same fired frameworks, same suggested
  sources).
- docs/zero-install.md — three zero-install paths (single-file
  detector, uvx, GitHub Action) with a decision matrix.
- docs/quickstart.md now leads with the zero-install detector before
  the install section.

Benchmark scaffolding:

- benchmark/ — frozen archetypes, four prompts (none mention
  Shipgate by name), five setup variants, tester-facing runbook,
  results CSV schema, and an upstream-PR tracker. The headline
  metric is the delta between `00-no-hints` and `10-agents-md` on
  the discovery rubric. Manual W2 baseline run is the next step.

Drift guards:

- tests/test_public_surface_contract.py extends the existing
  drift suite with checks for triggers.json/AGENTS.md row parity,
  llms-full.txt freshness via the build script's render(), and a
  parametrized "every prompt is mirrored to skills/" assertion.
- tests/test_zero_install_detector.py adds 24 parity tests
  pinning the zero-install script to the canonical CLI on every
  sample.

Verification: 733 pytest passes (709 W1 baseline + 24 new); ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix trigger evaluator precedence and decorator-rule reachability

Addresses three review findings on the trigger catalog landed in the
parent commit:

P1 (decorator rules unreachable from the prompt):
  prompts/decide-shipgate-relevance.md piped only `git diff --name-only`
  into the evaluator, so `diff_contains` rules (TRIGGER-FUNCTION-TOOL-
  DECORATOR, TRIGGER-FRAMEWORK-VERSION-BUMP, TRIGGER-SHIPGATE-CI-WORKFLOW
  Action match) silently never fired — agents following the prompt
  would skip a PR that only adds `@function_tool`.

  Add `--git-diff [REVSPEC]` to `python -m agents_shipgate.triggers`,
  which shells out to `git diff --name-only [REVSPEC]` AND `git diff
  [REVSPEC]` to populate paths and diff body in one call. Update the
  prompt's Option B to use it.

P2 (run_shipgate silently overrode skip_shipgate):
  A README-only diff that incidentally mentioned `@tool` (or quoted
  the Action URL) returned `run_shipgate: true` because the evaluator
  treated `has_run` as winning over `has_skip`, making the docs-only
  negative rule effectively dead.

  Reorder the precedence: stop_conditions → force_run → skip → run →
  dry_run. `skip_shipgate` now beats `run_shipgate`. To preserve the
  "manifest present means always run" semantic, promote
  TRIGGER-EXISTING-MANIFEST-PRESENT to a new action `force_run` that
  overrides skip — an opted-in repo's docs-only PR still scans because
  the cost is low and tool-adjacent prose can matter.

P3 (dry_run rules silently dead):
  When only TRIGGER-FRAMEWORK-VERSION-BUMP fired, the evaluator
  reported "No rules matched" — the prompt translated that to "do not
  propose Shipgate", making the dry_run rule non-actionable despite
  being in the catalog.

  Add a `dry_run_recommended` field to the evaluator output. When only
  dry_run rules match, `run_shipgate` stays false but the field is
  true and the rationale names the matched rules. The prompt now
  routes this state to "propose a non-mutating scan; do not propose
  init --write".

triggers.json gains an `actions` block describing each action's
semantics and an `action_precedence` array documenting the
high-to-low order. Both are reference material for an agent reading
the catalog directly.

Tests:
- New: skip beats run on docs-only with @tool in prose
- New: force_run beats skip when manifest present
- New: dry_run sets dry_run_recommended; rule appears in matched_rules
- New: pin TRIGGER-EXISTING-MANIFEST-PRESENT.action == "force_run"
- Updated: _VALID_TRIGGER_ACTIONS includes "force_run"

Verification: 737 pytest passes (was 733; +4 new); ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Cover tests in docs-only skip; fix bare --git-diff; update prompt verification

Three more review findings from PR #56:

P2a (tests-only diff with @tool slips through):
  AGENTS.md says "Pure read-only doc/test changes with no manifest
  impact" should skip, but TRIGGER-DOCS-ONLY-NEGATIVE only matched
  `**/*.md`. A tests-only diff that incidentally mentions
  `@function_tool` (in a fixture or assertion) returned
  `run_shipgate: true` because the broad decorator rule fired and no
  skip rule counterbalanced it.

  Extended the `every_file_matches` predicate to accept either a
  string (existing form, back-compat) or a list (any-of within the
  predicate). Updated the rule's pattern list to include `tests/**`,
  `test/**`, `**/tests/**`, `**/test/**`, `**/test_*.py`,
  `**/*_test.py`, `**/conftest.py`. Mixed docs+tests PRs now skip;
  code+tests mixes still trigger normally.

P2b (bare --git-diff misses staged and untracked):
  The prompt advertised bare `--git-diff` for "uncommitted changes"
  but the implementation ran plain `git diff`, which only shows
  unstaged changes. A staged `@function_tool` addition silently
  returned no matched rules.

  Bare flag now runs `git diff HEAD` for both paths and content
  (covering BOTH staged and unstaged tracked changes), then appends
  untracked file *paths* via `git ls-files --others --exclude-standard`.
  Untracked file *content* is not captured (reading arbitrary
  unstaged files into memory is risky); the prompt documents the
  limitation explicitly.

P3 (prompt verification contradicts the dry_run path):
  prompts/decide-shipgate-relevance.md's verification checklist named
  the output keys as `run_shipgate, matched_rules, rationale` (no
  `dry_run_recommended`) and asserted "no Shipgate command appears"
  whenever `run_shipgate: false`. That contradicts the dry_run path
  added in the previous commit, which explicitly proposes a
  non-mutating scan when `dry_run_recommended: true`.

  Updated the verification checklist to:
  - List all four canonical output keys including `dry_run_recommended`
  - Allow exactly one Shipgate command (a non-mutating scan) when
    dry_run_recommended is true and run_shipgate is false
  - Forbid Shipgate commands only when both are false
  Added two NOT-to-do bullets: never propose `init --write` on a
  dry_run-only match; bare --git-diff doesn't surface untracked file
  content. Mirrored to the skill copy.

Tests added (10 cases):
- Parametrized: 6 tests-only path patterns with @function_tool in
  diff all return run_shipgate=false
- Code+test mix with @function_tool returns run_shipgate=true
  (negative case for the every_file_matches expansion)
- _eval_predicate accepts every_file_matches as both string and list
- _git_diff_context with no revspec captures staged-only changes
- _git_diff_context with no revspec surfaces untracked file paths
  (and confirms untracked content is NOT in diff_text)

Verification: 747 pytest passes (was 737; +10 new); ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Clarify zero-install detector is a structural subset, not drop-in

P3 review: docs/zero-install.md and llms.txt implied the zero-install
script has the same JSON shape as `agents-shipgate detect --json`. It
doesn't — the canonical CLI emits `diagnostics[]` and `next_actions[]`
arrays (the diagnostic engine), which are intentionally out of scope
for the stdlib-only zero-install path. The script emits a structural
subset of `DetectResult` plus `script_version`.

- llms.txt: "same JSON shape" → "same structural verdict … emits the
  canonical `DetectResult` fields plus `script_version`, but NOT the
  CLI's `diagnostics` or `next_actions` arrays."
- docs/zero-install.md: rephrased "Output mirrors `agents-shipgate
  detect --json` (plus a `script_version` field)" to "structural
  subset … not a drop-in replacement." Closing line now reads
  "structural verdict parity" and explicitly notes "field-by-field
  byte parity is not pinned and not promised."
- tools/shipgate-detect.py docstring: lists `diagnostics[]` and
  `next_actions[]` in the "Intentional simplifications" section
  alongside the existing items (no git fast path, descriptive
  evidence strings, ±0.5 score variance).

Test: pin the absence of `diagnostics` and `next_actions` keys in the
script output so a future change that adds them is forced to update
the wording surfaces in the same PR.

Verification: 748 pytest passes (was 747; +1 new); ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Regenerate llms-full.txt after merging main (v0.11 schema bump)

CI on PR #56 caught the stale llms-full.txt: the merge of origin/main
brought in three commits (#54 init --agent-instructions, #55 autofix-
boundary docs, #51 v0.11 schema with source provenance) that touched
AGENTS.md and the report-schema literals, so the committed llms-full.txt
no longer matched what `scripts/build-llms-full.py` produces from the
post-merge sources.

Re-ran the build script and committed the regenerated file. Now:

- 859 tests pass + 3 skipped (was 748 on this branch; +111 from main's
  new tests for source provenance, agent-instructions renderers,
  managed blocks, etc.)
- ruff clean
- llms-full.txt is byte-identical to `render(REPO_ROOT)`, so
  test_llms_full_is_up_to_date passes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant