feat: agent self-correction via validation feedback loop#57
Merged
Conversation
Restructure validation into composable steps so typecheck (~5s) runs independently before full validation. Quick checks short-circuit on typecheck failure and format errors as actionable agent prompts, laying the foundation for the agent retry loop.
Extend the async generator in agent-interface to yield follow-up correction prompts when quick-checks (typecheck/build) fail. The agent retains full conversation context and gets up to 2 chances to fix its own mistakes before results surface to the user. Configurable via maxRetries option (default 2, 0 to disable).
Add retry-aware execution to AgentExecutor using the same async generator + quick-checks pattern from production. Evals now track three tiers: first-attempt, with-correction, and with-retry pass rates. Adds --no-correction flag to disable for baseline comparison.
AgentExecutor now delegates to the production runAgent instead of reimplementing the retry-aware async generator. Exports AgentRunConfig so evals can construct it directly, adds onMessage hook for latency tracking. Includes 13 tests verifying the wiring.
…rics First-attempt now means zero corrections, which is stricter than before. Lower threshold to 30% (aspirational), add withCorrectionPassRate at 90% as the primary quality gate, keep withRetryPassRate at 95%.
Two eval runs show ~21-27% first-attempt rate. The correction loop consistently brings it to 93-100%. Set threshold at 20% to catch regressions without failing on normal variance.
…hreshold detectTypecheckCommand was falling back to npx tsc --noEmit for every project including Python, Ruby, Go, etc. Now checks for tsconfig.json before falling back — no tsconfig means skip typecheck entirely. This eliminates false correction triggers on non-JS frameworks. Raises first-attempt threshold to 50% since the false positives were the main driver of the low rate.
…port Extend quick-checks to auto-detect Go (go.mod), Elixir (mix.exs), .NET (*.csproj), and Kotlin/Java (build.gradle) build commands from project files. Interpreted languages (Python, Ruby, PHP) pass through silently — no universal build command exists for them.
…parsing Raise firstAttemptPassRate from 50% to 80% now that false positives from non-TS projects are eliminated (85.7% observed in latest run). Fix quality grader parsing: the greedy regex matched braces inside <thinking> tags. Now extracts JSON only after </thinking> and uses a non-greedy pattern to avoid capturing nested objects.
…move dead code Extract passResult helper (4 identical object literals → 1 function), unify parseTypecheckErrors into single regex with Set dedup, extract quickCheckValidateAndFormat shared between agent-runner and eval executor, remove getIntegration indirection and dead continueUrl param.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runAgentso evals exercise the actual retry pathWhy
The installer ran its agent as a single-shot operation — when validation caught fixable issues, the results went to the user, not back to the agent. The agent never got a chance to fix its own mistakes.
Eval results (14 frameworks,
--state=example):Architecture
The retry loop uses an async generator that yields follow-up user messages into the SDK's
query(). The agent retains full conversation context.Changes
Quick checks (
src/lib/validation/quick-checks.ts): Typecheck + build as composable steps. Short-circuits on typecheck failure.quickCheckValidateAndFormatshared between production and evals.Multi-ecosystem build detection (
src/lib/validation/build-validator.ts):detectBuildCommandchecks package.json, go.mod, mix.exs, *.csproj, build.gradle. Returns null for interpreted languages.Retry loop (
src/lib/agent-interface.ts): Async generator yields correction prompts on validation failure. Promise-based turn coordination. ExportsAgentRunConfig+onMessagehook for evals.Evals (
tests/evals/agent-executor.ts): Delegates to productionrunAgent. Three-tier success criteria: first-attempt (80%), with-correction (90%), with-retry (95%).--no-correctionflag.Validator composability (
src/lib/validation/validator.ts): ExportedvalidatePackages,validateEnvVars,validateFiles,validateFrameworkSpecificwith return-based signatures.Notes
<thinking>tags)