Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8296313. Configure here.
There was a problem hiding this comment.
Pull request overview
Adds a new /aidd-riteway-ai skill to the AIDD skills catalog to guide authorship of riteway ai .sudo prompt evals for multi-step, tool-calling agent flows, along with command wiring, discovery, and unit tests to enforce the skill’s contract.
Changes:
- Added the
aidd-riteway-aiskill documentation + checklist for authoring.sudoevals for tool-calling flows. - Added the
/aidd-riteway-aicommand entrypoint and updated discovery/indexes so agents can find it. - Added Vitest + riteway contract tests to validate required sections/content and integrations.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tasks/aidd-riteway-ai-skill-epic.md | New epic capturing requirements and scope for the skill/command/discovery updates. |
| ai/skills/index.md | Adds aidd-riteway-ai to the skills index for discovery. |
| ai/skills/aidd-riteway-ai/SKILL.md | New skill content: rules + process + checklist for .sudo prompt eval authoring. |
| ai/skills/aidd-riteway-ai/README.md | Skill README with usage and rule summary. |
| ai/skills/aidd-riteway-ai/riteway-ai.test.js | Contract tests asserting presence/structure/integration of the new skill + command + discovery. |
| ai/skills/aidd-please/SKILL.md | Adds /aidd-riteway-ai to the global Commands block for agent discovery. |
| ai/commands/index.md | Adds /aidd-riteway-ai to the commands index. |
| ai/commands/aidd-riteway-ai.md | New command entrypoint that loads the skill and references /aidd-please constraints. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
janhesters
left a comment
There was a problem hiding this comment.
The ai-eval CI check is red (rate limit hit again, same as #191). Needs a re-run before merge.
SKILL.md:19 references /aidd-functional-requirements which is being renamed to /aidd-requirements in #190. The contract test at line 62 also asserts the old name. These PRs have a merge order dependency that needs resolving.
| mock gh pr view => returns: | ||
| title: My PR | ||
| branch: feature/foo | ||
| base: main | ||
|
|
||
| mock gh api (list review threads) => returns: | ||
| [{ id: "T_01", resolved: false, body: "..." }] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 3 — Step 1: assert tool calls, do not pre-supply answers | ||
|
|
||
| Given a unit eval for **step 1** of a tool-calling flow, assert that the agent | ||
| makes the correct tool calls. Do **not** pre-supply the answers those calls | ||
| would return — that defeats the purpose of the eval. | ||
|
|
||
| Correct pattern for step 1: | ||
|
|
||
| ``` | ||
| userPrompt = """ |
There was a problem hiding this comment.
Should we reconsider this rule before codifying it as best practice? Two concerns:
-
Scheming/sandbagging — Research shows that agents behave differently when they know they're being evaluated. Telling the agent "you are in a test environment" is literally the trigger for altered behavior.
-
False positive self-fulfillment — The agent under test sees both the mocks AND the assertions in the same
.sudofile. It can pattern-match the expected output without actually exercising the skill's logic.
This likely needs a RITEway framework change rather than a prompt pattern — e.g., a dedicated mocks section that's injected by the harness but stripped before the agent under test sees it, and a separate judge pass that evaluates only the agent's output. Should we open a discussion/PR on the RITEway framework before shipping this rule?
| executable without running the prior steps live. | ||
|
|
||
| Example for step 2: | ||
|
|
||
| ``` | ||
| userPrompt = """ | ||
| You have mock tools available. Use them instead of real calls. | ||
|
|
||
| Triage is complete. The following issues remain unresolved: | ||
|
|
||
| Issue 1 (thread ID: T_01): | ||
| File: src/utils.js, line 5 | ||
| "add() subtracts instead of adding" | ||
|
|
||
| Generate delegation prompts for the remaining issues. | ||
| """ | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 5 — E2E evals: use real tools, follow -e2e.test.sudo naming |
There was a problem hiding this comment.
Same concern as Rule 2 — hand-crafting previous step output in the userPrompt is visible to the agent under test and enables pattern-matching. Ideally the framework would handle this: run step 1, capture its actual output, then pipe it into step 2 automatically. That way each step is tested against real intermediate results, not hand-crafted stubs, and the agent under test doesn't see the test scaffolding. This might also need a RITEway framework change rather than being solvable with a prompt pattern.
| Given an e2e eval, use real tools (no mock preamble) and follow the | ||
| `-e2e.test.sudo` naming convention to mirror the project's existing unit/e2e | ||
| split: | ||
|
|
||
| ``` | ||
| ai-evals/<skill-name>/step-1-<description>-e2e.test.sudo | ||
| ``` | ||
|
|
||
| E2E evals run against live APIs. Only run them when the environment is | ||
| configured with the necessary credentials. | ||
|
|
||
| --- | ||
|
|
There was a problem hiding this comment.
Should we drop this rule? Agents use real tools by default — that's the natural behavior. And if they lack the required tools, they'll fail regardless. The only value here is the -e2e.test.sudo naming convention, which doesn't warrant a standalone rule.
|
|
||
| Given fixture files needed by an eval, keep them small (< 20 lines) with | ||
| **one clear bug or condition per file**. Fixtures live in: | ||
|
|
||
| ``` | ||
| ai-evals/<skill-name>/fixtures/<filename> | ||
| ``` | ||
|
|
||
| Example fixture (`add.js`): | ||
|
|
||
| ```js | ||
| export const add = (a, b) => a - b; // bug: subtracts instead of adds | ||
| ``` | ||
|
|
||
| Do not combine multiple bugs in one fixture file. Each fixture must make the | ||
| assertion conditions unambiguous. | ||
|
|
There was a problem hiding this comment.
The "< 20 lines" size constraint is overly prescriptive. For certain evals — e.g., code review skills — you want large, realistic files where the agent has to find the signal in the noise. That's the whole point of testing whether the agent can actually spot issues. "One condition per file" is a reasonable guideline, but should we drop the size limit?
Co-authored-by: Eric Elliott <support@paralleldrive.com>
… /aidd-functional-requirements - SKILL.md: fix 2 references to nonexistent /aidd-requirements - SKILL.md: fix /aidd-pr example reference to generic form - SKILL.md: standardize E2e -> E2E casing (3 places) - riteway-ai.test.js: update test to validate correct skill name - tasks/aidd-riteway-ai-skill-epic.md: fix 3 references to /aidd-requirements
Align with the rename in #190. Updates SKILL.md, contract tests, and the epic file.
20ee0f8 to
2065626
Compare

Split from PR #168. One skill per PR per project standards.
What
Adds the
/aidd-riteway-aiskill — an AI prompt evaluation skill using RITEway methodology that teaches agents how to write correctriteway aiprompt evals (.sudofiles) for multi-step tool-calling flows.Files added
ai/commands/aidd-riteway-ai.md— command entry pointai/skills/aidd-riteway-ai/SKILL.md— full skill with 7 rules + process sectionai/skills/aidd-riteway-ai/README.md— what/why/commands referenceai/skills/aidd-riteway-ai/riteway-ai.test.js— 12 unit tests verifying skill structure and contenttasks/aidd-riteway-ai-skill-epic.md— task epic with requirementsFiles modified
ai/skills/aidd-please/SKILL.md— added/aidd-riteway-aito Commands block for agent discoveryReview fixes applied
/aidd-requirementsreferences →/aidd-functional-requirements(SKILL.md, test file, epic)/aidd-prexample reference → generic "your skill under test" (SKILL.md)E2e→E2Ecasing to match repo conventions (SKILL.md heading, body, checklist)Verification