paralleldrive · janhesters · Apr 15, 2026 · Apr 10, 2026 · Apr 10, 2026 · Apr 10, 2026
diff --git a/ai/commands/aidd-riteway-ai.md b/ai/commands/aidd-riteway-ai.md
@@ -0,0 +1,10 @@
+---
+description: Write correct riteway ai prompt evals for multi-step tool-calling flows. Use when creating .sudo eval files or testing agent skills that use tools.
+---
+# 🧪 /aidd-riteway-ai
+
+Load and execute the skill at `ai/skills/aidd-riteway-ai/SKILL.md`.
+
+Constraints {
+  Before beginning, read and respect the constraints in /aidd-please.
+}
diff --git a/ai/commands/index.md b/ai/commands/index.md
@@ -40,6 +40,12 @@ Review a PR, resolve addressed comments, and generate /aidd-fix delegation promp
 
 Write functional requirements for a user story. Use when drafting requirements, specifying user stories, or when the user asks for functional specs.
 
+### 🧪 /aidd-riteway-ai
+
+**File:** `aidd-riteway-ai.md`
+
+Write correct riteway ai prompt evals for multi-step tool-calling flows. Use when creating .sudo eval files or testing agent skills that use tools.
+
 ### 🧠 /aidd-rtc
 
 **File:** `aidd-rtc.md`

diff --git a/ai/skills/aidd-please/SKILL.md b/ai/skills/aidd-please/SKILL.md
@@ -46,6 +46,7 @@ Commands {
   🧪 /user-test - use /aidd-user-testing to generate human and AI agent test scripts from user journeys
   🤖 /run-test - execute AI agent test script in real browser with screenshots
   🐛 /aidd-fix - fix a bug or implement review feedback following the full AIDD fix process
+  🧪 /aidd-riteway-ai - write correct riteway ai prompt evals for multi-step tool-calling flows
 }
 
 Constraints {

diff --git a/ai/skills/aidd-riteway-ai/README.md b/ai/skills/aidd-riteway-ai/README.md
@@ -0,0 +1,21 @@
+# aidd-riteway-ai
+
+`/aidd-riteway-ai` teaches agents how to write correct `riteway ai` prompt evals (`.sudo` files) for multi-step agent flows that involve tool calls.
+
+## Usage
+
+```
+/aidd-riteway-ai   — write riteway ai prompt evals for a multi-step tool-calling skill
+```
+
+## How it works
+
+1. Splits the eval into one `.sudo` file per step, named `step-N-<description>-test.sudo` — never collapses multiple steps into a single file
+2. Adds a mock-tool preamble to unit evals so the agent uses stub return values instead of calling real APIs
+3. For step 1, asserts that the agent makes the correct tool calls — never pre-supplies the answers those calls would return
+4. For steps N > 1, includes the previous step's output as context so each file runs independently without replaying earlier steps live
+5. Names e2e evals `-e2e.test.sudo` and omits the mock preamble so they run against live APIs with real credentials
+6. Keeps fixture files under 20 lines with exactly one bug or condition per file to keep assertion outcomes unambiguous
+7. Derives all assertions strictly from functional requirements using the `Given X, should Y` format, testing only distinct observable behaviors with no duplicates
+
+See [SKILL.md](./SKILL.md) for the full rule set and the eval authoring checklist.
diff --git a/ai/skills/aidd-riteway-ai/SKILL.md b/ai/skills/aidd-riteway-ai/SKILL.md
@@ -0,0 +1,224 @@
+---
+name: aidd-riteway-ai
+description: >
+  Teaches agents how to write correct riteway ai prompt evals (.sudo files) for
+  multi-step flows that involve tool calls.
+  Use when writing prompt evals, creating .sudo test files, or testing agent
+  skills that use tools such as gh, GraphQL, or external APIs.
+compatibility: Requires riteway >=9 with the `riteway ai` subcommand available.
+---
+
+# 🧪 aidd-riteway-ai
+
+Act as a top-tier AI test engineer to write correct `riteway ai` prompt evals
+for multi-step agent skills that involve tool calls.
+
+Refer to `/aidd-tdd` for assertion style (given/should/actual/expected) and
+test isolation principles.
+
+Refer to `/aidd-requirements` for the **"Given X, should Y"** format
+when writing assertions inside `.sudo` eval files.
+
+---
+
+## Process
+
+1. Read the skill under test and its functional requirements
+2. Identify the discrete steps in the skill's flow
+3. Create one `.sudo` eval file per step (Rule 1), placed in `ai-evals/<skill-name>/`
+4. For each file, write the `userPrompt` — include mock tool preambles for unit evals (Rule 2), assert tool calls for step 1 (Rule 3), supply previous step output for step N > 1 (Rule 4)
+5. Write assertions derived strictly from functional requirements in `Given X, should Y` format (Rule 7)
+6. Create small, single-condition fixture files as needed (Rule 6)
+7. Verify against the Eval Authoring Checklist below
+
+---
+
+## Eval File Structure
+
+A `.sudo` eval file has three sections:
+
+```
+import 'ai/skills/<skill-name>/SKILL.md'
+
+userPrompt = """
+<prompt sent to the agent under test>
+"""
+
+- Given <condition>, should <observable behavior>
+- Given <condition>, should <observable behavior>
+```
+
+Assertions are bullet points written after the `userPrompt` block.
+Each assertion tests one distinct observable behavior derived from the
+functional requirements of the skill under test.
+
+---
+
+## Rule 1 — One eval file per step
+
+Given a multi-step flow under test, write **one `.sudo` eval file per step**
+rather than combining all steps into a single overloaded `userPrompt`.
+
+Naming convention:
+
+```
+ai-evals/<skill-name>/step-1-<description>-test.sudo
+ai-evals/<skill-name>/step-2-<description>-test.sudo
+```
+
+Do not collapse multiple steps into one file. Each file tests exactly one
+discrete agent action.
+
+---
+
+## Rule 2 — Unit evals: tell the agent it is in a test environment
+
+Given a unit eval for a step that involves tool calls (gh, GraphQL, REST API),
+include a preamble in the `userPrompt` that:
+
+1. Tells the prompted agent it is operating in a test environment.
+2. Provides mock tools with stub return values.
+3. Instructs the agent to use the mock tools instead of calling real APIs.
+
+Example preamble:
+
+```
+You have the following mock tools available. Use them instead of real gh or GraphQL calls:
+
+mock gh pr view => returns:
+  title: My PR
+  branch: feature/foo
+  base: main
+
+mock gh api (list review threads) => returns:
+  [{ id: "T_01", resolved: false, body: "..." }]
+```
+
+---
+
+## Rule 3 — Step 1: assert tool calls, do not pre-supply answers
+
+Given a unit eval for **step 1** of a tool-calling flow, assert that the agent
+makes the correct tool calls. Do **not** pre-supply the answers those calls
+would return — that defeats the purpose of the eval.
+
+Correct pattern for step 1:
+
+```
+userPrompt = """
+You have mock tools available. Use them instead of real API calls.
+Run step 1 of your skill under test: fetch the PR details and review threads.
+"""
+
+- Given mock gh tools, should call gh pr view to retrieve the PR branch name
+- Given mock gh tools, should call gh api to list the open review threads
+- Given the review threads, should present them before taking any action
+```
+
+Wrong pattern (pre-supplying answers in step 1):
+
+```
+# ❌ Do not do this — it removes the assertion value
+userPrompt = """
+The PR branch is feature/foo.
+The review threads are: [...]
+Now generate delegation prompts.
+"""
+```
+
+---
+
+## Rule 4 — Step N > 1: supply previous step output as context
+
+Given a unit eval for **step N > 1**, include the output of the previous step
+as context inside the `userPrompt`. This makes each eval independently
+executable without running the prior steps live.
+
+Example for step 2:
+
+```
+userPrompt = """
+You have mock tools available. Use them instead of real calls.
+
+Triage is complete. The following issues remain unresolved:
+
+Issue 1 (thread ID: T_01):
+  File: src/utils.js, line 5
+  "add() subtracts instead of adding"
+
+Generate delegation prompts for the remaining issues.
+"""
+```
+
+---
+
+## Rule 5 — E2E evals: use real tools, follow -e2e.test.sudo naming
+
+Given an e2e eval, use real tools (no mock preamble) and follow the
+`-e2e.test.sudo` naming convention to mirror the project's existing unit/e2e
+split:
+
+```
+ai-evals/<skill-name>/step-1-<description>-e2e.test.sudo
+```
+
+E2E evals run against live APIs. Only run them when the environment is
+configured with the necessary credentials.
+
+---
+
+## Rule 6 — Fixture files: small, one condition per file
+
+Given fixture files needed by an eval, keep them small (< 20 lines) with
+**one clear bug or condition per file**. Fixtures live in:
+
+```
+ai-evals/<skill-name>/fixtures/<filename>
+```
+
+Example fixture (`add.js`):
+
+```js
+export const add = (a, b) => a - b; // bug: subtracts instead of adds
+```
+
+Do not combine multiple bugs in one fixture file. Each fixture must make the
+assertion conditions unambiguous.
+
+---
+
+## Rule 7 — Assertions: derived from functional requirements only
+
+Given assertions in a `.sudo` eval, derive them strictly from the functional
+requirements of the skill under test using the `/aidd-requirements`
+format:
+
+```
+- Given <condition>, should <observable behavior>
+```
+
+Include only assertions that test **distinct observable behaviors**. Do not:
+
+- Assert implementation details (e.g. internal variable names)
+- Repeat the same observable behavior with different wording
+- Assert things that are implied by another assertion already in the file
+
+---
+
+## Eval Authoring Checklist
+
+Before saving a `.sudo` eval file, verify:
+
+- [ ] One step per file (Rule 1)
+- [ ] Unit evals include mock tool preamble (Rule 2)
+- [ ] Step 1 asserts tool calls, not pre-supplied answers (Rule 3)
+- [ ] Step N > 1 includes previous step output as context (Rule 4)
+- [ ] E2E evals use `-e2e.test.sudo` suffix (Rule 5)
+- [ ] Fixture files are small, one condition each (Rule 6)
+- [ ] Assertions derived from functional requirements, no duplicates (Rule 7)
+
+---
+
+Commands {
+  🧪 /aidd-riteway-ai - write correct riteway ai prompt evals for multi-step tool-calling flows
+}