-
Notifications
You must be signed in to change notification settings - Fork 26
feat(skills): add /aidd-riteway-ai #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3540583
22da199
95dec3a
2065626
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| --- | ||
| description: Write correct riteway ai prompt evals for multi-step tool-calling flows. Use when creating .sudo eval files or testing agent skills that use tools. | ||
| --- | ||
| # 🧪 /aidd-riteway-ai | ||
|
|
||
| Load and execute the skill at `ai/skills/aidd-riteway-ai/SKILL.md`. | ||
|
|
||
| Constraints { | ||
| Before beginning, read and respect the constraints in /aidd-please. | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # aidd-riteway-ai | ||
|
|
||
| `/aidd-riteway-ai` teaches agents how to write correct `riteway ai` prompt evals (`.sudo` files) for multi-step agent flows that involve tool calls. | ||
|
|
||
| ## Usage | ||
|
|
||
| ``` | ||
| /aidd-riteway-ai — write riteway ai prompt evals for a multi-step tool-calling skill | ||
| ``` | ||
|
|
||
| ## How it works | ||
|
|
||
| 1. Splits the eval into one `.sudo` file per step, named `step-N-<description>-test.sudo` — never collapses multiple steps into a single file | ||
| 2. Adds a mock-tool preamble to unit evals so the agent uses stub return values instead of calling real APIs | ||
| 3. For step 1, asserts that the agent makes the correct tool calls — never pre-supplies the answers those calls would return | ||
| 4. For steps N > 1, includes the previous step's output as context so each file runs independently without replaying earlier steps live | ||
| 5. Names e2e evals `-e2e.test.sudo` and omits the mock preamble so they run against live APIs with real credentials | ||
| 6. Keeps fixture files under 20 lines with exactly one bug or condition per file to keep assertion outcomes unambiguous | ||
| 7. Derives all assertions strictly from functional requirements using the `Given X, should Y` format, testing only distinct observable behaviors with no duplicates | ||
|
|
||
| See [SKILL.md](./SKILL.md) for the full rule set and the eval authoring checklist. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,224 @@ | ||
| --- | ||
| name: aidd-riteway-ai | ||
| description: > | ||
| Teaches agents how to write correct riteway ai prompt evals (.sudo files) for | ||
| multi-step flows that involve tool calls. | ||
| Use when writing prompt evals, creating .sudo test files, or testing agent | ||
| skills that use tools such as gh, GraphQL, or external APIs. | ||
| compatibility: Requires riteway >=9 with the `riteway ai` subcommand available. | ||
| --- | ||
|
|
||
| # 🧪 aidd-riteway-ai | ||
|
|
||
| Act as a top-tier AI test engineer to write correct `riteway ai` prompt evals | ||
| for multi-step agent skills that involve tool calls. | ||
|
|
||
| Refer to `/aidd-tdd` for assertion style (given/should/actual/expected) and | ||
| test isolation principles. | ||
|
|
||
| Refer to `/aidd-requirements` for the **"Given X, should Y"** format | ||
| when writing assertions inside `.sudo` eval files. | ||
|
|
||
| --- | ||
|
|
||
| ## Process | ||
|
|
||
| 1. Read the skill under test and its functional requirements | ||
| 2. Identify the discrete steps in the skill's flow | ||
| 3. Create one `.sudo` eval file per step (Rule 1), placed in `ai-evals/<skill-name>/` | ||
| 4. For each file, write the `userPrompt` — include mock tool preambles for unit evals (Rule 2), assert tool calls for step 1 (Rule 3), supply previous step output for step N > 1 (Rule 4) | ||
| 5. Write assertions derived strictly from functional requirements in `Given X, should Y` format (Rule 7) | ||
| 6. Create small, single-condition fixture files as needed (Rule 6) | ||
| 7. Verify against the Eval Authoring Checklist below | ||
|
|
||
| --- | ||
|
|
||
| ## Eval File Structure | ||
|
|
||
| A `.sudo` eval file has three sections: | ||
|
|
||
| ``` | ||
| import 'ai/skills/<skill-name>/SKILL.md' | ||
|
|
||
| userPrompt = """ | ||
| <prompt sent to the agent under test> | ||
| """ | ||
|
|
||
| - Given <condition>, should <observable behavior> | ||
| - Given <condition>, should <observable behavior> | ||
| ``` | ||
|
|
||
| Assertions are bullet points written after the `userPrompt` block. | ||
| Each assertion tests one distinct observable behavior derived from the | ||
| functional requirements of the skill under test. | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 1 — One eval file per step | ||
|
|
||
| Given a multi-step flow under test, write **one `.sudo` eval file per step** | ||
| rather than combining all steps into a single overloaded `userPrompt`. | ||
|
|
||
| Naming convention: | ||
|
|
||
| ``` | ||
| ai-evals/<skill-name>/step-1-<description>-test.sudo | ||
| ai-evals/<skill-name>/step-2-<description>-test.sudo | ||
| ``` | ||
|
|
||
| Do not collapse multiple steps into one file. Each file tests exactly one | ||
| discrete agent action. | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 2 — Unit evals: tell the agent it is in a test environment | ||
|
|
||
| Given a unit eval for a step that involves tool calls (gh, GraphQL, REST API), | ||
| include a preamble in the `userPrompt` that: | ||
|
|
||
| 1. Tells the prompted agent it is operating in a test environment. | ||
| 2. Provides mock tools with stub return values. | ||
| 3. Instructs the agent to use the mock tools instead of calling real APIs. | ||
|
|
||
| Example preamble: | ||
|
|
||
| ``` | ||
| You have the following mock tools available. Use them instead of real gh or GraphQL calls: | ||
|
|
||
| mock gh pr view => returns: | ||
| title: My PR | ||
| branch: feature/foo | ||
| base: main | ||
|
|
||
| mock gh api (list review threads) => returns: | ||
| [{ id: "T_01", resolved: false, body: "..." }] | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 3 — Step 1: assert tool calls, do not pre-supply answers | ||
|
|
||
| Given a unit eval for **step 1** of a tool-calling flow, assert that the agent | ||
| makes the correct tool calls. Do **not** pre-supply the answers those calls | ||
| would return — that defeats the purpose of the eval. | ||
|
|
||
| Correct pattern for step 1: | ||
|
|
||
| ``` | ||
| userPrompt = """ | ||
| You have mock tools available. Use them instead of real API calls. | ||
| Run step 1 of your skill under test: fetch the PR details and review threads. | ||
| """ | ||
|
|
||
| - Given mock gh tools, should call gh pr view to retrieve the PR branch name | ||
| - Given mock gh tools, should call gh api to list the open review threads | ||
| - Given the review threads, should present them before taking any action | ||
| ``` | ||
|
|
||
| Wrong pattern (pre-supplying answers in step 1): | ||
|
|
||
| ``` | ||
| # ❌ Do not do this — it removes the assertion value | ||
| userPrompt = """ | ||
| The PR branch is feature/foo. | ||
| The review threads are: [...] | ||
| Now generate delegation prompts. | ||
| """ | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 4 — Step N > 1: supply previous step output as context | ||
|
|
||
| Given a unit eval for **step N > 1**, include the output of the previous step | ||
| as context inside the `userPrompt`. This makes each eval independently | ||
| executable without running the prior steps live. | ||
|
|
||
| Example for step 2: | ||
|
|
||
| ``` | ||
| userPrompt = """ | ||
| You have mock tools available. Use them instead of real calls. | ||
|
|
||
| Triage is complete. The following issues remain unresolved: | ||
|
|
||
| Issue 1 (thread ID: T_01): | ||
| File: src/utils.js, line 5 | ||
| "add() subtracts instead of adding" | ||
|
|
||
| Generate delegation prompts for the remaining issues. | ||
| """ | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Rule 5 — E2E evals: use real tools, follow -e2e.test.sudo naming | ||
|
Comment on lines
+135
to
+155
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same concern as Rule 2 — hand-crafting previous step output in the |
||
|
|
||
| Given an e2e eval, use real tools (no mock preamble) and follow the | ||
| `-e2e.test.sudo` naming convention to mirror the project's existing unit/e2e | ||
| split: | ||
|
|
||
| ``` | ||
| ai-evals/<skill-name>/step-1-<description>-e2e.test.sudo | ||
| ``` | ||
|
|
||
| E2E evals run against live APIs. Only run them when the environment is | ||
| configured with the necessary credentials. | ||
|
|
||
| --- | ||
|
|
||
|
Comment on lines
+157
to
+169
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we drop this rule? Agents use real tools by default — that's the natural behavior. And if they lack the required tools, they'll fail regardless. The only value here is the |
||
| ## Rule 6 — Fixture files: small, one condition per file | ||
|
|
||
| Given fixture files needed by an eval, keep them small (< 20 lines) with | ||
| **one clear bug or condition per file**. Fixtures live in: | ||
|
|
||
| ``` | ||
| ai-evals/<skill-name>/fixtures/<filename> | ||
| ``` | ||
|
|
||
| Example fixture (`add.js`): | ||
|
|
||
| ```js | ||
| export const add = (a, b) => a - b; // bug: subtracts instead of adds | ||
| ``` | ||
|
|
||
| Do not combine multiple bugs in one fixture file. Each fixture must make the | ||
| assertion conditions unambiguous. | ||
|
|
||
|
Comment on lines
+171
to
+187
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "< 20 lines" size constraint is overly prescriptive. For certain evals — e.g., code review skills — you want large, realistic files where the agent has to find the signal in the noise. That's the whole point of testing whether the agent can actually spot issues. "One condition per file" is a reasonable guideline, but should we drop the size limit? |
||
| --- | ||
|
|
||
| ## Rule 7 — Assertions: derived from functional requirements only | ||
|
|
||
| Given assertions in a `.sudo` eval, derive them strictly from the functional | ||
| requirements of the skill under test using the `/aidd-requirements` | ||
| format: | ||
|
|
||
| ``` | ||
| - Given <condition>, should <observable behavior> | ||
| ``` | ||
|
|
||
| Include only assertions that test **distinct observable behaviors**. Do not: | ||
|
|
||
| - Assert implementation details (e.g. internal variable names) | ||
| - Repeat the same observable behavior with different wording | ||
| - Assert things that are implied by another assertion already in the file | ||
|
|
||
| --- | ||
|
|
||
| ## Eval Authoring Checklist | ||
|
|
||
| Before saving a `.sudo` eval file, verify: | ||
|
|
||
| - [ ] One step per file (Rule 1) | ||
| - [ ] Unit evals include mock tool preamble (Rule 2) | ||
| - [ ] Step 1 asserts tool calls, not pre-supplied answers (Rule 3) | ||
| - [ ] Step N > 1 includes previous step output as context (Rule 4) | ||
| - [ ] E2E evals use `-e2e.test.sudo` suffix (Rule 5) | ||
| - [ ] Fixture files are small, one condition each (Rule 6) | ||
| - [ ] Assertions derived from functional requirements, no duplicates (Rule 7) | ||
|
|
||
| --- | ||
|
|
||
| Commands { | ||
| 🧪 /aidd-riteway-ai - write correct riteway ai prompt evals for multi-step tool-calling flows | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we reconsider this rule before codifying it as best practice? Two concerns:
Scheming/sandbagging — Research shows that agents behave differently when they know they're being evaluated. Telling the agent "you are in a test environment" is literally the trigger for altered behavior.
False positive self-fulfillment — The agent under test sees both the mocks AND the assertions in the same
.sudofile. It can pattern-match the expected output without actually exercising the skill's logic.This likely needs a RITEway framework change rather than a prompt pattern — e.g., a dedicated mocks section that's injected by the harness but stripped before the agent under test sees it, and a separate judge pass that evaluates only the agent's output. Should we open a discussion/PR on the RITEway framework before shipping this rule?