refactor(skill-runner): split into SkillDefinitionGAgent + SkillExecutionGAgent#497
refactor(skill-runner): split into SkillDefinitionGAgent + SkillExecutionGAgent#497
Conversation
…Agent + SkillExecutionGAgent Separates the long-lived subscription/schedule state from per-execution session state so retry/schedule collisions are eliminated and new templates require zero code changes (issue #491). - SkillDefinitionGAgent: owns schedule, config, outbound config; spawns SkillExecutionGAgent on each cron fire - SkillExecutionGAgent: session-scoped, owns LLM execution, streaming delivery, tool failure safety net, retry semantics - New proto files for definition and execution state/commands/events - Legacy aliases for SkillRunnerState → SkillDefinitionState migration - CI guard: tools/ci/skill_runner_lifecycle_split_guard.sh - All 667 channel runtime tests + 105 architecture tests pass Closes #491 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review: SkillDefinitionGAgent + SkillExecutionGAgent SplitOverall the split is clean — long-lived definition actor vs session-scoped execution actor is the right architecture per #491 and aligns with the AGENTS.md lifecycle rules. A few issues worth addressing: 1. HandleTriggerAsync: fire-and-forget execution dispatch with no error propagationIn catch (Exception ex)
{
Logger.LogWarning(ex, "Skill definition {ActorId} failed to spawn execution {ExecutionId}", Id, executionId);
}
await Scheduler.ScheduleNextRunAsync(now, CancellationToken.None);If the execution actor spawn or dispatch fails (actor runtime overloaded, transient infrastructure error), the definition actor silently schedules the next cron. No Suggestion: On spawn/dispatch failure, emit a domain event and update the registry status so the missed run is visible in 2. SkillExecutionGAgent.HandleStartAsync: retry counter resets are lost on crashIn But: There's a subtle bug — Suggestion: Either schedule a watchdog timeout immediately after persisting the started event (so a crash is recovered), or document that this is acceptable for session-scoped actors. 3. UpdateRegistryExecutionAsync uses definitionId as the registry key, but error_count is overwritten to 0 on successIn await UpdateRegistryExecutionAsync(
SkillDefinitionDefaults.StatusRunning,
Timestamp.FromDateTimeOffset(now),
0, // errorCount always 0
string.Empty,
CancellationToken.None);And in the failure path: await UpdateRegistryExecutionAsync(
SkillDefinitionDefaults.StatusError,
Timestamp.FromDateTimeOffset(now),
1, // hardcoded to 1, not cumulative
ex.Message,
CancellationToken.None);The 4. _retryLease is a mutable field on the actor but is not part of persisted state
The Suggestion: Add an idempotency guard at the top of 5. SkillRunnerToolFailureCounter — name not updatedThe class Minor
|
ReviewThe split of Issues to fix: 1. HandleTriggerAsync swallows spawn failures and still advances the schedule
Recommendation: Either persist a 2.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 91b545e309
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await UpdateRegistryExecutionAsync( | ||
| SkillDefinitionDefaults.StatusRunning, | ||
| Timestamp.FromDateTimeOffset(now), |
There was a problem hiding this comment.
Preserve disabled status when execution completes
This execution-path update unconditionally writes StatusRunning for the definition entry, so a long-running execution can overwrite a user-issued disable that happened while the run was in flight. In the split design, SkillDefinitionGAgent and SkillExecutionGAgent now run concurrently, so this race is real: disable sets catalog status to disabled, then completion flips it back to running even though scheduling remains off. That leaves lifecycle state inconsistent and can mislead run_agent/status flows that key off catalog status.
Useful? React with 👍 / 👎.
| var command = new UserAgentCatalogExecutionUpdateCommand | ||
| { | ||
| AgentId = definitionId, Status = status, | ||
| LastRunAt = lastRunAt, | ||
| ErrorCount = errorCount, LastError = lastError ?? string.Empty, |
There was a problem hiding this comment.
Keep next scheduled run when updating execution state
This command omits NextRunAt, but UserAgentCatalogGAgent.ApplyExecutionUpdated assigns existing.NextRunAt = evt.NextRunAt directly; because the field is unset here, each execution update clears the catalog's next-fire timestamp. After the first run, agent status/read models lose next_scheduled_run, which breaks schedule visibility and any polling logic that relies on that field.
Useful? React with 👍 / 👎.
eanzhao
left a comment
There was a problem hiding this comment.
P2: SkillExecutionState.scheduled_at never populated
SkillExecutionGAgent.cs HandleStartAsync receives StartSkillExecutionCommand.scheduled_at (set by the definition actor at SkillDefinitionGAgent.cs:133) but the value is discarded:
SkillExecutionStartedEventhas noscheduled_atfieldApplyStarted(SkillExecutionGAgent.cs:515) does not setnext.ScheduledAt
Result: SkillExecutionState.scheduled_at stays null forever. Either propagate the field through the event and applier, or add reserved 2 to SkillExecutionState to match how SkillDefinitionState handled fields 8/10/11/12.
| private async Task EnsureSkillRunnerActorAsync(string agentId, CancellationToken ct) | ||
| private async Task EnsureDefinitionActorAsync(string agentId, CancellationToken ct) | ||
| { | ||
| _ = await _actorRuntime.GetAsync(agentId) |
There was a problem hiding this comment.
Existing scheduled agents were created with the Orleans runtime state AgentTypeName set to the old SkillRunnerGAgent assembly-qualified type. GetAsync(agentId) only checks that this stored type name is non-empty, so it will return a non-null actor here, but when the grain later handles the envelope it calls Type.GetType on the deleted SkillRunnerGAgent type, fails initialization, and drops the command. That means existing /daily agents cannot be triggered/disabled/enabled after this deploy. Please add an actor-type migration/alias path before treating existing actors as valid SkillDefinitionGAgents, or keep a compatible old type that forwards to the new definition actor.
| Enabled = evt.Enabled, | ||
| ScopeId = evt.ScopeId, | ||
| ProviderName = evt.ProviderName, | ||
| Model = evt.Model, |
There was a problem hiding this comment.
This legacy replay path drops the optional LLM tuning fields from persisted SkillRunnerInitializedEvents (temperature, max_tokens, max_tool_rounds, max_history_messages). Existing agents that customized these will silently replay into SkillDefinitionState with temperature/max_tokens unset and max tool/history reset to defaults, changing runtime behavior after the split. Please copy the Has* fields here (and in HandleLegacyInitializeAsync) the same way HandleInitializeAsync does.
eanzhao
left a comment
There was a problem hiding this comment.
Review
整体架构拆分方向正确,生命周期分离清晰,proto 定义和 legacy alias 的迁移设计合理。发现几个问题需要确认或修改:
1. SkillDefinitionGAgent.HandleTriggerAsync 中 spawn execution 失败的异常被静默吞掉
SkillDefinitionGAgent.cs 的 HandleTriggerAsync 方法中,try-catch 在 CreateAsync/DispatchAsync 失败时只记录 warning,不会向调用方或 registry 报告失败。这意味着:
- 用户手动 trigger 后,如果 execution actor 创建失败,没有任何错误反馈
- registry 中的
last_error/error_count不会更新 - 用户无法感知到持续失败
建议:在 catch 中至少更新 registry status 为 error,或者 persist 一个事件让 projection 可以观察到失败。
2. SkillExecutionGAgent.HandleRetryAsync 没有更新 State.RetryAttempts
SkillExecutionGAgent.cs 的 HandleRetryAsync 使用 command.RetryAttempt 做判断,但 State.RetryAttempts 在 retry 链中不会被更新(ApplyStarted 没有设置它,retry 路径也没有 persist 事件更新它)。当前 MaxRetryAttempts = 1 所以这个 bug 不会触发,但如果将来增加重试次数,会导致基于 state 的判断出错。建议增加一个 SkillExecutionRetryStartedEvent 来更新 RetryAttempts。
3. ChannelArchitectureTests 中 SkillDefinitionGAgent 的 proactiveCallerPatterns 注释已过时
ChannelArchitectureTests.cs 中 SkillDefinitionGAgent 被加入 proactiveCallerPatterns,但它现在不再直接调用任何 channel/Conversation 方法(只负责创建 SkillExecutionGAgent)。注释描述 "must dispatch through ConversationGAgent command envelopes rather than directly invoking IChannelOutboundPort.ContinueConversationAsync" 对 SkillDefinitionGAgent 已不准确。建议更新注释说明,或考虑是否需要将其保留在该列表中。
4. SkillExecutionState proto 中 field 9 跳过缺少说明
skill_execution.proto 中 SkillExecutionState 从 field 8 (status) 跳到 field 10 (skill_content),跳过了 field 9。虽然 protobuf 允许,但建议加一行注释说明 field 9 的预留意图,避免后续维护者困惑。
5. 缺少 SkillExecutionGAgent 核心逻辑(retry、failure)的独立单元测试
当前 SkillDefinitionGAgentTests.cs 主要测试了 SendOutputAsync,但 HandleStartAsync 的 catch-retry 路径、HandleRetryAsync 的最终失败路径等核心逻辑缺少独立的单元测试。虽然这些路径在旧版中也没有被直接测试,但拆分后 SkillExecutionGAgent 是一个新的核心 actor,建议至少增加 retry 调度和失败处理的 mock 测试(无需真实 LLM provider)。
其他方面(proto reserved 字段、legacy alias、backward-compat handlers、CI guard)都处理得很好,没有问题。
|
Blocking on #498 This PR deletes
Per #498, the correct resolution is the stable Do not merge until #498 Phase 1 is in |
Summary
Splits
SkillRunnerGAgentinto two generic actors, keeping template differences purely in instance data (#491):SkillDefinitionGAgent(long-lived fact owner): owns subscription, schedule, config, outbound config. On each cron fire, spawns a session-scoped execution actor with a config snapshot.SkillExecutionGAgent(session-scoped): owns LLM execution, streaming-edit delivery, tool failure safety net (bug(daily): SkillRunner masks GitHub tool failures as silent "no activity" successful runs #439), and retry semantics. Naturally discardable; run history is its own projection.SkillDefinitionGAgentinstances with differentskill_content+ toolset.tools/ci/skill_runner_lifecycle_split_guard.shvalidates: noSkillRunnerGAgentclass, no per-execution fields in definition state, no per-template actor types.SkillRunnerStateevents toSkillDefinitionStatefor seamless migration on next activation.Changes
skill_definition.proto(definition state withreserved 8,10,11,12for dropped execution fields) andskill_execution.proto(session state + config snapshot)SkillRunnerGAgent→SkillDefinitionGAgent(GAgentBase, no LLM) + newSkillExecutionGAgent(AIGAgentBase, LLM execution)ISkillRunnerCommandPort→ISkillDefinitionCommandPort+SkillDefinitionCommandPortSkillRunnerDefaults→SkillDefinitionDefaults(same constant values for backward compat)ScheduledServiceCollectionExtensionsregistrationAgentBuilderTool,AgentBuilderCardFlow,NyxRelayAgentBuilderFlowupdated to new typesSkillDefinitionGAgentfor existing persisted events/durable callbackstools/ci/skill_runner_lifecycle_split_guard.shTest plan
dotnet build aevatar.slnx— build succeedsdotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/— 667/667 passdotnet test test/Aevatar.Architecture.Tests/— 105/105 passbash tools/ci/skill_runner_lifecycle_split_guard.sh— okbash tools/ci/architecture_guards.sh— all relevant guards passCloses #491
🤖 Generated with Claude Code