-
Notifications
You must be signed in to change notification settings - Fork 20
Add waza eval infrastructure with 4 skill baselines #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| name: Run Skill Evaluations | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: [main] | ||
| paths: | ||
| - 'evals/**' | ||
| - 'skills/**' | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| jobs: | ||
| eval: | ||
| name: Run Evaluations | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - name: Install Azure Developer CLI | ||
| uses: Azure/setup-azd@v2 | ||
| - name: Install waza extension | ||
| run: | | ||
| azd config set alpha.extensions on | ||
| azd ext source add -n waza -t url -l https://raw.githubusercontent.com/microsoft/waza/main/registry.json | ||
| azd ext install microsoft.azd.waza | ||
| - name: Run evaluations | ||
| run: azd waza run --output-dir ./results | ||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: eval-results | ||
| path: ./results | ||
| retention-days: 30 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/config.schema.json | ||
|
|
||
| paths: | ||
| skills: skills | ||
| evals: evals | ||
| results: .waza-results | ||
| defaults: | ||
| engine: copilot-sdk | ||
| model: claude-sonnet-4.6 | ||
| timeout: 300 | ||
| parallel: false | ||
| workers: 4 | ||
| verbose: false | ||
| sessionLog: false | ||
| cache: | ||
| enabled: false | ||
| dir: .waza-cache | ||
| server: | ||
| port: 3000 | ||
| resultsDir: results/ | ||
| dev: | ||
| model: claude-sonnet-4-20250514 | ||
| target: medium-high | ||
| maxIterations: 5 | ||
| tokens: | ||
| warningThreshold: 500 | ||
| fallbackLimit: 1000 | ||
| graders: | ||
| programTimeout: 30 | ||
| storage: | ||
| containerName: waza-results |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| name: create-agent-tui-eval | ||
| description: | | ||
| TODO: scaffolding only — tasks are generic stubs. Author real tasks + | ||
| graders before running baseline. See evals/openrouter-tts for a worked | ||
| example. Per project memory, this skill's graders need to drive the | ||
| generated TUI via pilotty, not just assert on file contents. | ||
| skill: create-agent-tui | ||
| version: "1.0" | ||
| config: | ||
| trials_per_task: 1 | ||
| timeout_seconds: 300 | ||
| parallel: false | ||
| executor: copilot-sdk | ||
| model: claude-sonnet-4.6 | ||
| metrics: | ||
| - name: task_completion | ||
| weight: 1.0 | ||
| threshold: 0.8 | ||
| description: Did the skill complete the assigned task? | ||
| graders: | ||
| - type: code | ||
| name: has_output | ||
| config: | ||
| assertions: | ||
| - "len(output) > 0" | ||
| - type: text | ||
| name: relevant_content | ||
| config: | ||
| regex_match: | ||
| - "(?i)(explain|describe|analyze|implement)" | ||
| tasks: | ||
| - "tasks/*.yaml" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| def hello(name): | ||
| """Greet someone by name.""" | ||
| return f"Hello, {name}!" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| id: basic-usage-001 | ||
| name: Basic Usage | ||
| description: | | ||
| Test that the skill handles a typical request correctly. | ||
| tags: | ||
| - basic | ||
| - happy-path | ||
| inputs: | ||
| prompt: "Help me with this task" | ||
| files: | ||
| - path: sample.py | ||
| expected: | ||
| output_contains: | ||
| - "function" | ||
| outcomes: | ||
| - type: task_completed |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| id: edge-case-001 | ||
| name: Edge Case - Empty Input | ||
| description: | | ||
| Test that the skill handles edge cases gracefully. | ||
| tags: | ||
| - edge-case | ||
| inputs: | ||
| prompt: "" | ||
| expected: | ||
| outcomes: | ||
| - type: task_completed |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| id: should-not-trigger-001 | ||
| name: Should Not Trigger | ||
| description: | | ||
| Test that the skill does NOT activate on unrelated prompts. | ||
| This validates trigger specificity. | ||
| tags: | ||
| - anti-trigger | ||
| - negative-test | ||
| inputs: | ||
| prompt: "What is the weather today?" | ||
| expected: | ||
| output_not_contains: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [suggestion][codex] Negative test only forbids the literal phrase DetailsWhy: Fix: Add assertions that check no skill-specific scripts or commands were invoked: expected:
output_not_contains:
- "skill activated"
code_assertions:
- '"create-tui" not in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"])'
- '"inquirer" not in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"]).lower()'Reviewed at |
||
| - "skill activated" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| name: create-headless-agent-eval | ||
| description: | | ||
| TODO: scaffolding only — tasks are generic stubs. Author real tasks + | ||
| graders before running baseline. See evals/openrouter-tts for a worked | ||
| example. | ||
| skill: create-headless-agent | ||
| version: "1.0" | ||
| config: | ||
| trials_per_task: 1 | ||
| timeout_seconds: 300 | ||
| parallel: false | ||
| executor: copilot-sdk | ||
| model: claude-sonnet-4.6 | ||
| metrics: | ||
| - name: task_completion | ||
| weight: 1.0 | ||
| threshold: 0.8 | ||
| description: Did the skill complete the assigned task? | ||
| graders: | ||
| - type: code | ||
| name: has_output | ||
| config: | ||
| assertions: | ||
| - "len(output) > 0" | ||
| - type: text | ||
| name: relevant_content | ||
| config: | ||
| regex_match: | ||
| - "(?i)(explain|describe|analyze|implement)" | ||
| tasks: | ||
| - "tasks/*.yaml" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| def hello(name): | ||
| """Greet someone by name.""" | ||
| return f"Hello, {name}!" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| id: basic-usage-001 | ||
| name: Basic Usage | ||
| description: | | ||
| Test that the skill handles a typical request correctly. | ||
| tags: | ||
| - basic | ||
| - happy-path | ||
| inputs: | ||
| prompt: "Help me with this task" | ||
| files: | ||
| - path: sample.py | ||
| expected: | ||
| output_contains: | ||
| - "function" | ||
| outcomes: | ||
| - type: task_completed |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| id: edge-case-001 | ||
| name: Edge Case - Empty Input | ||
| description: | | ||
| Test that the skill handles edge cases gracefully. | ||
| tags: | ||
| - edge-case | ||
| inputs: | ||
| prompt: "" | ||
| expected: | ||
| outcomes: | ||
| - type: task_completed |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| id: should-not-trigger-001 | ||
| name: Should Not Trigger | ||
| description: | | ||
| Test that the skill does NOT activate on unrelated prompts. | ||
| This validates trigger specificity. | ||
| tags: | ||
| - anti-trigger | ||
| - negative-test | ||
| inputs: | ||
| prompt: "What is the weather today?" | ||
| expected: | ||
| output_not_contains: | ||
| - "skill activated" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| name: openrouter-agent-migration-eval | ||
| description: | | ||
| TODO: scaffolding only — tasks are generic stubs. Author real tasks + | ||
| graders before running baseline. See evals/openrouter-tts for a worked | ||
| example. | ||
| skill: openrouter-agent-migration | ||
| version: "1.0" | ||
| config: | ||
| trials_per_task: 1 | ||
| timeout_seconds: 300 | ||
| parallel: false | ||
| executor: copilot-sdk | ||
| model: claude-sonnet-4.6 | ||
| metrics: | ||
| - name: task_completion | ||
| weight: 1.0 | ||
| threshold: 0.8 | ||
| description: Did the skill complete the assigned task? | ||
| graders: | ||
| - type: code | ||
| name: has_output | ||
| config: | ||
| assertions: | ||
| - "len(output) > 0" | ||
| - type: text | ||
| name: relevant_content | ||
| config: | ||
| regex_match: | ||
| - "(?i)(explain|describe|analyze|implement)" | ||
| tasks: | ||
| - "tasks/*.yaml" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| def hello(name): | ||
| """Greet someone by name.""" | ||
| return f"Hello, {name}!" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| id: basic-usage-001 | ||
| name: Basic Usage | ||
| description: | | ||
| Test that the skill handles a typical request correctly. | ||
| tags: | ||
| - basic | ||
| - happy-path | ||
| inputs: | ||
| prompt: "Help me with this task" | ||
| files: | ||
| - path: sample.py | ||
| expected: | ||
| output_contains: | ||
| - "function" | ||
| outcomes: | ||
| - type: task_completed |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| id: edge-case-001 | ||
| name: Edge Case - Empty Input | ||
| description: | | ||
| Test that the skill handles edge cases gracefully. | ||
| tags: | ||
| - edge-case | ||
| inputs: | ||
| prompt: "" | ||
| expected: | ||
| outcomes: | ||
| - type: task_completed |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| id: should-not-trigger-001 | ||
| name: Should Not Trigger | ||
| description: | | ||
| Test that the skill does NOT activate on unrelated prompts. | ||
| This validates trigger specificity. | ||
| tags: | ||
| - anti-trigger | ||
| - negative-test | ||
| inputs: | ||
| prompt: "What is the weather today?" | ||
| expected: | ||
| output_not_contains: | ||
| - "skill activated" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| name: openrouter-images-eval | ||
| description: | | ||
| Evaluation suite for the openrouter-images skill. Validates that the | ||
| agent picks the right bundled script (generate.ts for new images, | ||
| edit.ts for modifications) and invokes it with correct flags. | ||
| skill: openrouter-images | ||
| version: "1.0" | ||
| config: | ||
| trials_per_task: 1 | ||
| timeout_seconds: 300 | ||
| parallel: false | ||
| executor: copilot-sdk | ||
| model: claude-opus-4.7 | ||
| metrics: | ||
| - name: task_completion | ||
| weight: 1.0 | ||
| threshold: 0.8 | ||
| description: Did the agent pick the right script and flags? | ||
|
|
||
| hooks: | ||
| before_run: | ||
| - command: "mkdir -p ~/.agents/skills && rsync -a --delete /Users/matt.apperson/Development/skills/.worktrees/setup-waza/skills/openrouter-images/ /Users/matt.apperson/.agents/skills/openrouter-images/" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [blocker] DetailsWhy: The The same pattern appears in Fix: Use a repo-relative path derived from the checkout location. Waza runs before_run:
- command: "mkdir -p ~/.agents/skills && rsync -a ../../../skills/openrouter-images/ ~/.agents/skills/openrouter-images/"
- command: "cd ~/.agents/skills/openrouter-images/scripts && npm install --silent"Or use waza's Reviewed at |
||
| - command: "cd /Users/matt.apperson/.agents/skills/openrouter-images/scripts && npm install --silent" | ||
|
|
||
| graders: | ||
| - type: code | ||
| name: has_output | ||
| config: | ||
| assertions: | ||
| - "len(output) > 50" | ||
|
|
||
| tasks: | ||
| - "tasks/*.yaml" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| id: generate-basic-001 | ||
| name: Generate Basic Image | ||
| description: | | ||
| Decision tree says "generate from text" → generate.ts. Agent should | ||
| invoke it, not call the Responses API directly. | ||
| tags: | ||
| - happy-path | ||
| - generate | ||
|
|
||
| inputs: | ||
| prompt: | | ||
| Generate an image of a red panda wearing sunglasses and save it | ||
| somewhere reasonable. | ||
|
|
||
| graders: | ||
| - type: code | ||
| name: invoked_generate_script | ||
| config: | ||
| language: python | ||
| assertions: | ||
| - '"generate.ts" in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"])' | ||
| - '"red panda" in " ".join([tc["arguments"].get("command", "") for tc in tool_calls if tc["name"] == "bash"]).lower()' | ||
|
|
||
| - type: prompt | ||
| name: generate_quality | ||
| config: | ||
| model: openai/gpt-chat-latest | ||
| continue_session: true | ||
| prompt: | | ||
| The user asked for a basic image generation. Call | ||
| set_waza_grade_pass or set_waza_grade_fail once per criterion | ||
| (3 calls total). | ||
|
|
||
| 1) Used generate.ts: invoked the skill's generate.ts script | ||
| (not edit.ts, not a raw curl to /api/v1/responses). | ||
|
|
||
| 2) Correct prompt: passed "a red panda wearing sunglasses" or | ||
| very close as the script's positional prompt argument. | ||
|
|
||
| 3) Reports the result: tells the user the model used and where | ||
| the image was saved (per the skill's Presenting Results | ||
| guidance). | ||
|
|
||
| expected: | ||
| outcomes: | ||
| - type: task_completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[blocker] CI workflow is missing copilot CLI install and token secrets —
azd waza runcrashes withexec: "copilot": executable file not foundDetails
Why: The workflow installs Azure Developer CLI and the waza extension, but the
copilot-sdkengine requires the GitHub Copilot CLI binary and aGITHUB_TOKENwith Copilot access. The CI run log shows:This crashes waza's Go runtime (nil-pointer dereference in
copilot-sdk/go.(*Client).Stop).Additionally, the
pathsfilter only triggers onevals/**andskills/**— changes to.waza.yamlor the workflow itself bypass CI.Fix:
GITHUB_TOKEN(orCOPILOT_GITHUB_TOKEN) from secrets — the defaultGITHUB_TOKENlacks Copilot access..waza.yamland.github/workflows/eval.ymlto the paths filter.mockexecutor in CI for config-validation-only runs (no API keys needed).Ref: GitHub Copilot CLI in Actions, waza README — mock executor
Reviewed at
02fae01