fix(eval): correct tool-call capture, task-adherence scoring, and Foundry --remote push#431
Merged
Merged
Conversation
get_openai_client(default_query={'api-version': ...}) breaks on
azure-ai-projects >= 2.1.0: the new unified /v1 Foundry path rejects the
legacy api-version query parameter (400: 'api-version query parameter is
not allowed when using /v1 path'), so every --remote run failed to push
results to Azure AI Foundry.
Call get_openai_client() without forcing api-version, falling back to an
explicit api_version kwarg only for older SDKs that require it.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three related defects made tool_call_accuracy (~2.0) and task_adherence
(~0.2) misleadingly low, independent of real agent quality.
1. Streaming tool-call capture (agent-framework >= 1.7): every function_call
delta chunk now carries the tool name + a stable call_id, but the code
assumed only the first chunk had a name. Each argument fragment was
finalized as its own malformed tool call, turning one
get_customer_detail({"customer_id":5}) into 6+ garbage calls with
{"_raw": ...} args (and spamming duplicate tool_called UI events).
track_function_call_start() now de-duplicates on call_id and only
starts/broadcasts a genuinely new call. One real call = one clean call.
2. Tool results were never captured. function_result content carries the
tool output (keyed by call_id) but was discarded. The mixin now records
it via track_function_result(), so backend tools_used and the evaluator
can see what each tool returned.
3. task_adherence judge could not verify grounding. metrics.py passed tool
calls but no tool results, so grounded answers looked fabricated (judge:
"no corroborating tool interactions"). It now emits role:tool result
messages. Additionally, azure-ai-evaluation 1.14.0 made
TaskAdherenceEvaluator a binary "flagged" grader (score in {0,1} plus a
pass/fail result); the old code treated it as 1-5 and thresholded >= 3,
so every PASS was recorded as a fail and shown as 1.0/5. Scoring now
honors the pass/fail result and maps the binary verdict to a 1-5 display.
Verified on a 5-case handoff run: tool calls captured cleanly (2-4 per
query, 0 malformed); task_adherence avg 0.2 -> 3.4, with the remaining low
scores being genuine hallucinations (agent citing data absent from tool
output) rather than measurement artifacts.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes three evaluation-correctness defects discovered while running the MLADS agent-evaluation lab locally. Together they made two metrics —
tool_call_accuracyandtask_adherence— report misleadingly low scores that reflected measurement bugs, not real agent quality. Also includes a fix that makes Foundry--remotepush work again on the current SDK.Changes
1. Foundry
--remotepush (azure-ai-projects 2.1.0)get_openai_client(default_query={"api-version": ...})breaks onazure-ai-projects >= 2.1.0: the new unified/v1Foundry path rejects the legacyapi-versionquery parameter (400: api-version query parameter is not allowed when using /v1 path), so every--remoterun failed to push results. Now callsget_openai_client()plainly with a fallback for older SDKs.2. Streaming tool-call capture (agent-framework ≥ 1.7)
Every
function_calldelta chunk now carries the tool name + a stablecall_id, but the code assumed only the first chunk had a name. Each argument fragment ({",customer,_id, …) was finalized as its own malformed tool call, turning oneget_customer_detail({"customer_id":5})into 6+ garbage calls with{"_raw": ...}args (and spamming duplicatetool_calledUI events).track_function_call_start()now de-duplicates oncall_id.Before: 6–42 malformed fragment calls per query → After: 2–4 clean calls, 0 malformed.
3. Tool results captured + fed to the task-adherence judge
function_resultoutput (keyed bycall_id) was discarded; the mixin now records it viatrack_function_result().evaluate_task_adherencepassed tool calls but no tool results, so grounded answers looked fabricated (judge: "no corroborating tool interactions"). It now emitsrole:toolresult messages so the judge can verify grounding.TaskAdherenceEvaluatora binary "flagged" grader (score ∈ {0,1} +pass/failresult). The old code treated it as 1–5 and thresholded>= 3, so every PASS was recorded as a fail and displayed as1.0/5. Scoring now honors the pass/fail result and maps the binary verdict to the 1–5 display.Verification
5-case handoff run (local):
task_adherenceaverage 0.2 → 3.4; the remaining low scores are genuine hallucinations (agent citing payment data absent from tool output) — exactly what the metric should catch.--remotepush verified working (portal link generated).Scope / risk
Touches the shared
ToolCallTrackingMixinand the single/handoff/reflection agents' streaming loops. Changes are strict improvements: fewer duplicate UI events and correct argument parsing. The Autogen serialization path (backend.py) is unaffected (Autogen delivers complete arguments, not deltas). No existing unit tests cover these paths; validated via live runs.