fix(eval): correct tool-call capture, task-adherence scoring, and Foundry --remote push by james-tn · Pull Request #431 · microsoft/OpenAIWorkshop

james-tn · 2026-06-10T03:10:40Z

Summary

Fixes three evaluation-correctness defects discovered while running the MLADS agent-evaluation lab locally. Together they made two metrics — tool_call_accuracy and task_adherence — report misleadingly low scores that reflected measurement bugs, not real agent quality. Also includes a fix that makes Foundry --remote push work again on the current SDK.

Changes

1. Foundry `--remote` push (azure-ai-projects 2.1.0)

get_openai_client(default_query={"api-version": ...}) breaks on azure-ai-projects >= 2.1.0: the new unified /v1 Foundry path rejects the legacy api-version query parameter (400: api-version query parameter is not allowed when using /v1 path), so every --remote run failed to push results. Now calls get_openai_client() plainly with a fallback for older SDKs.

2. Streaming tool-call capture (agent-framework ≥ 1.7)

Every function_call delta chunk now carries the tool name + a stable call_id, but the code assumed only the first chunk had a name. Each argument fragment ({", customer, _id, …) was finalized as its own malformed tool call, turning one get_customer_detail({"customer_id":5}) into 6+ garbage calls with {"_raw": ...} args (and spamming duplicate tool_called UI events). track_function_call_start() now de-duplicates on call_id.

Before: 6–42 malformed fragment calls per query → After: 2–4 clean calls, 0 malformed.

3. Tool results captured + fed to the task-adherence judge

function_result output (keyed by call_id) was discarded; the mixin now records it via track_function_result().
evaluate_task_adherence passed tool calls but no tool results, so grounded answers looked fabricated (judge: "no corroborating tool interactions"). It now emits role:tool result messages so the judge can verify grounding.
azure-ai-evaluation 1.14.0 made TaskAdherenceEvaluator a binary "flagged" grader (score ∈ {0,1} + pass/fail result). The old code treated it as 1–5 and thresholded >= 3, so every PASS was recorded as a fail and displayed as 1.0/5. Scoring now honors the pass/fail result and maps the binary verdict to the 1–5 display.

Verification

5-case handoff run (local):

Tool calls captured cleanly (2–4/query, 0 malformed vs 6–42 before).
task_adherence average 0.2 → 3.4; the remaining low scores are genuine hallucinations (agent citing payment data absent from tool output) — exactly what the metric should catch.
--remote push verified working (portal link generated).

Scope / risk

Touches the shared ToolCallTrackingMixin and the single/handoff/reflection agents' streaming loops. Changes are strict improvements: fewer duplicate UI events and correct argument parsing. The Autogen serialization path (backend.py) is unaffected (Autogen delivers complete arguments, not deltas). No existing unit tests cover these paths; validated via live runs.

get_openai_client(default_query={'api-version': ...}) breaks on azure-ai-projects >= 2.1.0: the new unified /v1 Foundry path rejects the legacy api-version query parameter (400: 'api-version query parameter is not allowed when using /v1 path'), so every --remote run failed to push results to Azure AI Foundry. Call get_openai_client() without forcing api-version, falling back to an explicit api_version kwarg only for older SDKs that require it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three related defects made tool_call_accuracy (~2.0) and task_adherence (~0.2) misleadingly low, independent of real agent quality. 1. Streaming tool-call capture (agent-framework >= 1.7): every function_call delta chunk now carries the tool name + a stable call_id, but the code assumed only the first chunk had a name. Each argument fragment was finalized as its own malformed tool call, turning one get_customer_detail({"customer_id":5}) into 6+ garbage calls with {"_raw": ...} args (and spamming duplicate tool_called UI events). track_function_call_start() now de-duplicates on call_id and only starts/broadcasts a genuinely new call. One real call = one clean call. 2. Tool results were never captured. function_result content carries the tool output (keyed by call_id) but was discarded. The mixin now records it via track_function_result(), so backend tools_used and the evaluator can see what each tool returned. 3. task_adherence judge could not verify grounding. metrics.py passed tool calls but no tool results, so grounded answers looked fabricated (judge: "no corroborating tool interactions"). It now emits role:tool result messages. Additionally, azure-ai-evaluation 1.14.0 made TaskAdherenceEvaluator a binary "flagged" grader (score in {0,1} plus a pass/fail result); the old code treated it as 1-5 and thresholded >= 3, so every PASS was recorded as a fail and shown as 1.0/5. Scoring now honors the pass/fail result and maps the binary verdict to a 1-5 display. Verified on a 5-case handoff run: tool calls captured cleanly (2-4 per query, 0 malformed); task_adherence avg 0.2 -> 3.4, with the remaining low scores being genuine hallucinations (agent citing data absent from tool output) rather than measurement artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

James N. and others added 2 commits June 9, 2026 19:25

james-tn temporarily deployed to production June 10, 2026 03:11 — with GitHub Actions Inactive

james-tn merged commit 9c91bf7 into main Jun 10, 2026
18 checks passed

james-tn deleted the fix/eval-foundry-remote-client branch June 10, 2026 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): correct tool-call capture, task-adherence scoring, and Foundry --remote push#431

fix(eval): correct tool-call capture, task-adherence scoring, and Foundry --remote push#431
james-tn merged 2 commits into
mainfrom
fix/eval-foundry-remote-client

james-tn commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

james-tn commented Jun 10, 2026

Summary

Changes

1. Foundry --remote push (azure-ai-projects 2.1.0)

2. Streaming tool-call capture (agent-framework ≥ 1.7)

3. Tool results captured + fed to the task-adherence judge

Verification

Scope / risk

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Foundry `--remote` push (azure-ai-projects 2.1.0)