Skip to content

refactor(skill-runner): split into SkillDefinitionGAgent + SkillExecutionGAgent#497

Open
eanzhao wants to merge 1 commit intodevfrom
refactor/2026-04-28_skill-runner-lifecycle-split
Open

refactor(skill-runner): split into SkillDefinitionGAgent + SkillExecutionGAgent#497
eanzhao wants to merge 1 commit intodevfrom
refactor/2026-04-28_skill-runner-lifecycle-split

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 28, 2026

Summary

Splits SkillRunnerGAgent into two generic actors, keeping template differences purely in instance data (#491):

  • SkillDefinitionGAgent (long-lived fact owner): owns subscription, schedule, config, outbound config. On each cron fire, spawns a session-scoped execution actor with a config snapshot.
  • SkillExecutionGAgent (session-scoped): owns LLM execution, streaming-edit delivery, tool failure safety net (bug(daily): SkillRunner masks GitHub tool failures as silent "no activity" successful runs #439), and retry semantics. Naturally discardable; run history is its own projection.
  • New template = new configuration, zero code changes — daily report, social media, any future template are all SkillDefinitionGAgent instances with different skill_content + toolset.
  • CI guard tools/ci/skill_runner_lifecycle_split_guard.sh validates: no SkillRunnerGAgent class, no per-execution fields in definition state, no per-template actor types.
  • Legacy aliases map existing persisted SkillRunnerState events to SkillDefinitionState for seamless migration on next activation.

Changes

Area What changed
Proto New skill_definition.proto (definition state with reserved 8,10,11,12 for dropped execution fields) and skill_execution.proto (session state + config snapshot)
Actors SkillRunnerGAgentSkillDefinitionGAgent (GAgentBase, no LLM) + new SkillExecutionGAgent (AIGAgentBase, LLM execution)
Command port ISkillRunnerCommandPortISkillDefinitionCommandPort + SkillDefinitionCommandPort
Defaults SkillRunnerDefaultsSkillDefinitionDefaults (same constant values for backward compat)
DI Updated ScheduledServiceCollectionExtensions registration
Integration AgentBuilderTool, AgentBuilderCardFlow, NyxRelayAgentBuilderFlow updated to new types
Migration Legacy aliases + backward-compat command handlers on SkillDefinitionGAgent for existing persisted events/durable callbacks
CI guard tools/ci/skill_runner_lifecycle_split_guard.sh
Tests All 667 channel runtime tests + 105 architecture tests pass

Test plan

  • dotnet build aevatar.slnx — build succeeds
  • dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/ — 667/667 pass
  • dotnet test test/Aevatar.Architecture.Tests/ — 105/105 pass
  • bash tools/ci/skill_runner_lifecycle_split_guard.sh — ok
  • bash tools/ci/architecture_guards.sh — all relevant guards pass

Closes #491

🤖 Generated with Claude Code

…Agent + SkillExecutionGAgent

Separates the long-lived subscription/schedule state from per-execution
session state so retry/schedule collisions are eliminated and new
templates require zero code changes (issue #491).

- SkillDefinitionGAgent: owns schedule, config, outbound config;
  spawns SkillExecutionGAgent on each cron fire
- SkillExecutionGAgent: session-scoped, owns LLM execution, streaming
  delivery, tool failure safety net, retry semantics
- New proto files for definition and execution state/commands/events
- Legacy aliases for SkillRunnerState → SkillDefinitionState migration
- CI guard: tools/ci/skill_runner_lifecycle_split_guard.sh
- All 667 channel runtime tests + 105 architecture tests pass

Closes #491

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: SkillDefinitionGAgent + SkillExecutionGAgent Split

Overall the split is clean — long-lived definition actor vs session-scoped execution actor is the right architecture per #491 and aligns with the AGENTS.md lifecycle rules. A few issues worth addressing:


1. HandleTriggerAsync: fire-and-forget execution dispatch with no error propagation

In SkillDefinitionGAgent.HandleTriggerAsync, the execution actor is created and dispatched inside a catch-all that only logs:

catch (Exception ex)
{
    Logger.LogWarning(ex, "Skill definition {ActorId} failed to spawn execution {ExecutionId}", Id, executionId);
}
await Scheduler.ScheduleNextRunAsync(now, CancellationToken.None);

If the execution actor spawn or dispatch fails (actor runtime overloaded, transient infrastructure error), the definition actor silently schedules the next cron. No SkillExecutionFailedEvent is ever emitted, no registry status update occurs, and the user has zero visibility into the missed execution. Per AGENTS.md: "已提交领域事件必须可观察" — if the failure is never committed as a domain event, it is invisible to the system.

Suggestion: On spawn/dispatch failure, emit a domain event and update the registry status so the missed run is visible in /agent-status.


2. SkillExecutionGAgent.HandleStartAsync: retry counter resets are lost on crash

In HandleStartAsync, on failure the retry is scheduled via ScheduleRetryAsync which stores the _retryLease as an in-memory field. If the actor deactivates between retries (it's session-scoped, so this is plausible), _retryLease is lost. When the durable callback fires and HandleRetryAsync runs, it receives command.RetryAttempt from the callback payload — so the retry attempt counter is correct. However, the actor will re-execute from scratch, and the HandleStartAsync path on fresh activation won't be triggered (only HandleRetryAsync fires from the callback). This seems fine for the retry path.

But: There's a subtle bug — HandleStartAsync does await PersistDomainEventAsync(started) then runs ExecuteSkillAsync. If the actor crashes mid-execution (e.g. process restart), on reactivation the actor replays SkillExecutionStartedEvent and has status = "running" in state, but no one will ever trigger the retry or mark it failed. The durable timeout for retry is only scheduled after ExecuteSkillAsync throws. A crash before that point means the execution is permanently stuck in "running" with no recovery path.

Suggestion: Either schedule a watchdog timeout immediately after persisting the started event (so a crash is recovered), or document that this is acceptable for session-scoped actors.


3. UpdateRegistryExecutionAsync uses definitionId as the registry key, but error_count is overwritten to 0 on success

In HandleStartAsync success path:

await UpdateRegistryExecutionAsync(
    SkillDefinitionDefaults.StatusRunning,
    Timestamp.FromDateTimeOffset(now),
    0,          // errorCount always 0
    string.Empty,
    CancellationToken.None);

And in the failure path:

await UpdateRegistryExecutionAsync(
    SkillDefinitionDefaults.StatusError,
    Timestamp.FromDateTimeOffset(now),
    1,          // hardcoded to 1, not cumulative
    ex.Message,
    CancellationToken.None);

The errorCount in the failure path is always 1 regardless of how many retries actually failed. Since this updates the definition's registry entry, it should probably reflect the actual retry attempt count (e.g. State.RetryAttempts + 1).


4. _retryLease is a mutable field on the actor but is not part of persisted state

_retryLease is an in-memory RuntimeCallbackLease? field. Per AGENTS.md: "Actor/模块运行态只能在事件处理主线程修改" — this is fine for single-threaded actor turns. But if the actor is re-activated after deactivation, _retryLease is null, and CancelRetryLeaseAsync becomes a no-op. This means the old durable callback lease is leaked (never cancelled) if the execution completes after a reactivation. The durable callback will eventually fire and call HandleRetryAsync on a completed/stale execution, which will try to re-execute.

The HandleRetryAsync method doesn't guard against re-execution on an already-completed actor (no status check like if (State.Status != "running") return;).

Suggestion: Add an idempotency guard at the top of HandleRetryAsync (and HandleStartAsync).


5. SkillRunnerToolFailureCounter — name not updated

The class SkillRunnerToolFailureCounter and SkillRunnerStreamingReplySink still carry the SkillRunner prefix while everything else has been renamed to SkillDefinition/SkillExecution. The CI guard (skill_runner_lifecycle_split_guard.sh) only checks for SkillRunnerGAgent class, so these are technically allowed, but the naming inconsistency makes the codebase harder to navigate. Consider renaming to SkillExecutionToolFailureCounter etc. in a follow-up.


Minor

  • SkillExecutionState field 9 is skipped (jumps from 8 to 10). This is fine for protobuf but worth a comment in the proto for future readers.
  • The SkillDefinitionDefaults.GenerateExecutionId method uses Guid.NewGuid():N which is non-deterministic. If HandleTriggerAsync is retried (e.g. at-least-once delivery), a new execution actor is created each time. This may be intentional but should be documented.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review

The split of SkillRunnerGAgent into SkillDefinitionGAgent (long-lived fact owner) + SkillExecutionGAgent (session-scoped) is architecturally sound and well-aligned with the CLAUDE.md rules. Proto design is clean with proper reserved fields. Legacy migration via aliases + compat handlers is thorough. Good CI guard.

Issues to fix:

1. HandleTriggerAsync swallows spawn failures and still advances the schedule

SkillDefinitionGAgent.cs:124-168: If actorRuntime.CreateAsync or dispatchPort.DispatchAsync throws, the catch block logs a warning and falls through to ScheduleNextRunAsync. This means if the execution actor can never be created (e.g., DI misconfiguration), the definition will keep firing on schedule, logging warnings, and producing no output indefinitely — with no error surfaced to the catalog or user.

Recommendation: Either persist a SkillDefinitionExecutionFailedEvent (so the catalog reflects the error), or re-throw after logging so the durable callback system can surface the failure. At minimum, update the registry status to StatusError in the catch block.

2. _retryLease should not be mutable instance state on the actor

SkillExecutionGAgent.cs:26: The field _retryLease is written in ScheduleRetryAsync and CancelRetryLeaseAsync during event handling. While safe due to single-threaded actor turns, this is a durable callback handle that can be restored from state on reactivation. If the execution actor is passivated and reactivated mid-retry, _retryLease will be null and CancelRetryLeaseAsync will silently skip cancellation — the retry will still fire (durable callback survives passivation), but the agent won't be able to cancel it if it completes before the retry fires.

Recommendation: Either (a) don't store the lease at all (let the durable callback fire and handle it idempotently in HandleRetryAsync), or (b) persist the lease in actor state so it survives passivation.

3. HandleEnableAsync skips the event when already enabled

SkillDefinitionGAgent.cs:197-210: If State.Enabled is already true, no SkillDefinitionEnabledEvent is persisted but ScheduleNextRunAsync and UpdateRegistryStatusAsync are still called. This is subtly different from HandleDisableAsync which always persists an event. The asymmetry is intentional (idempotent enable), but worth documenting with a comment since it could look like an oversight.

4. UpdateRegistryStatusAsync always sends ErrorCount = 0

SkillDefinitionGAgent.cs:256-264: This method hardcodes ErrorCount = 0 and LastError = string.Empty. It's called from UpsertRegistryAsync (after initialization) and HandleDisableAsync/HandleEnableAsync. While the execution agent's UpdateRegistryExecutionAsync corrects the error count after execution, there's a window after HandleDisableAsync where the error count from a previous failed execution gets wiped to 0. Consider preserving the existing error count or only resetting it on HandleEnableAsync.

5. Minor: test file still named SkillRunnerGAgentTests.cs

The test class was renamed to SkillDefinitionGAgentTests but the file is still SkillRunnerGAgentTests.cs. Not a blocker but inconsistent with the naming convention.


Otherwise the design is solid. The execution config snapshot pattern (copy definition config into StartSkillExecutionCommand → persist in SkillExecutionStartedEvent) ensures execution state is self-contained and survives replay. The SkillRunnerToolFailureCounter being per-agent-instance via BuildToolMiddlewareChain is correct — no cross-execution contamination. The CI guard script is well-structured.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 91b545e309

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +151 to +153
await UpdateRegistryExecutionAsync(
SkillDefinitionDefaults.StatusRunning,
Timestamp.FromDateTimeOffset(now),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve disabled status when execution completes

This execution-path update unconditionally writes StatusRunning for the definition entry, so a long-running execution can overwrite a user-issued disable that happened while the run was in flight. In the split design, SkillDefinitionGAgent and SkillExecutionGAgent now run concurrently, so this race is real: disable sets catalog status to disabled, then completion flips it back to running even though scheduling remains off. That leaves lifecycle state inconsistent and can mislead run_agent/status flows that key off catalog status.

Useful? React with 👍 / 👎.

Comment on lines +506 to +510
var command = new UserAgentCatalogExecutionUpdateCommand
{
AgentId = definitionId, Status = status,
LastRunAt = lastRunAt,
ErrorCount = errorCount, LastError = lastError ?? string.Empty,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep next scheduled run when updating execution state

This command omits NextRunAt, but UserAgentCatalogGAgent.ApplyExecutionUpdated assigns existing.NextRunAt = evt.NextRunAt directly; because the field is unset here, each execution update clears the catalog's next-fire timestamp. After the first run, agent status/read models lose next_scheduled_run, which breaks schedule visibility and any polling logic that relies on that field.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: SkillExecutionState.scheduled_at never populated

SkillExecutionGAgent.cs HandleStartAsync receives StartSkillExecutionCommand.scheduled_at (set by the definition actor at SkillDefinitionGAgent.cs:133) but the value is discarded:

  1. SkillExecutionStartedEvent has no scheduled_at field
  2. ApplyStarted (SkillExecutionGAgent.cs:515) does not set next.ScheduledAt

Result: SkillExecutionState.scheduled_at stays null forever. Either propagate the field through the event and applier, or add reserved 2 to SkillExecutionState to match how SkillDefinitionState handled fields 8/10/11/12.

private async Task EnsureSkillRunnerActorAsync(string agentId, CancellationToken ct)
private async Task EnsureDefinitionActorAsync(string agentId, CancellationToken ct)
{
_ = await _actorRuntime.GetAsync(agentId)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing scheduled agents were created with the Orleans runtime state AgentTypeName set to the old SkillRunnerGAgent assembly-qualified type. GetAsync(agentId) only checks that this stored type name is non-empty, so it will return a non-null actor here, but when the grain later handles the envelope it calls Type.GetType on the deleted SkillRunnerGAgent type, fails initialization, and drops the command. That means existing /daily agents cannot be triggered/disabled/enabled after this deploy. Please add an actor-type migration/alias path before treating existing actors as valid SkillDefinitionGAgents, or keep a compatible old type that forwards to the new definition actor.

Enabled = evt.Enabled,
ScopeId = evt.ScopeId,
ProviderName = evt.ProviderName,
Model = evt.Model,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This legacy replay path drops the optional LLM tuning fields from persisted SkillRunnerInitializedEvents (temperature, max_tokens, max_tool_rounds, max_history_messages). Existing agents that customized these will silently replay into SkillDefinitionState with temperature/max_tokens unset and max tool/history reset to defaults, changing runtime behavior after the split. Please copy the Has* fields here (and in HandleLegacyInitializeAsync) the same way HandleInitializeAsync does.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

整体架构拆分方向正确,生命周期分离清晰,proto 定义和 legacy alias 的迁移设计合理。发现几个问题需要确认或修改:

1. SkillDefinitionGAgent.HandleTriggerAsync 中 spawn execution 失败的异常被静默吞掉

SkillDefinitionGAgent.csHandleTriggerAsync 方法中,try-catchCreateAsync/DispatchAsync 失败时只记录 warning,不会向调用方或 registry 报告失败。这意味着:

  • 用户手动 trigger 后,如果 execution actor 创建失败,没有任何错误反馈
  • registry 中的 last_error / error_count 不会更新
  • 用户无法感知到持续失败

建议:在 catch 中至少更新 registry status 为 error,或者 persist 一个事件让 projection 可以观察到失败。

2. SkillExecutionGAgent.HandleRetryAsync 没有更新 State.RetryAttempts

SkillExecutionGAgent.csHandleRetryAsync 使用 command.RetryAttempt 做判断,但 State.RetryAttempts 在 retry 链中不会被更新(ApplyStarted 没有设置它,retry 路径也没有 persist 事件更新它)。当前 MaxRetryAttempts = 1 所以这个 bug 不会触发,但如果将来增加重试次数,会导致基于 state 的判断出错。建议增加一个 SkillExecutionRetryStartedEvent 来更新 RetryAttempts

3. ChannelArchitectureTestsSkillDefinitionGAgent 的 proactiveCallerPatterns 注释已过时

ChannelArchitectureTests.csSkillDefinitionGAgent 被加入 proactiveCallerPatterns,但它现在不再直接调用任何 channel/Conversation 方法(只负责创建 SkillExecutionGAgent)。注释描述 "must dispatch through ConversationGAgent command envelopes rather than directly invoking IChannelOutboundPort.ContinueConversationAsync" 对 SkillDefinitionGAgent 已不准确。建议更新注释说明,或考虑是否需要将其保留在该列表中。

4. SkillExecutionState proto 中 field 9 跳过缺少说明

skill_execution.protoSkillExecutionState 从 field 8 (status) 跳到 field 10 (skill_content),跳过了 field 9。虽然 protobuf 允许,但建议加一行注释说明 field 9 的预留意图,避免后续维护者困惑。

5. 缺少 SkillExecutionGAgent 核心逻辑(retry、failure)的独立单元测试

当前 SkillDefinitionGAgentTests.cs 主要测试了 SendOutputAsync,但 HandleStartAsync 的 catch-retry 路径、HandleRetryAsync 的最终失败路径等核心逻辑缺少独立的单元测试。虽然这些路径在旧版中也没有被直接测试,但拆分后 SkillExecutionGAgent 是一个新的核心 actor,建议至少增加 retry 调度和失败处理的 mock 测试(无需真实 LLM provider)。


其他方面(proto reserved 字段、legacy alias、backward-compat handlers、CI guard)都处理得很好,没有问题。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Blocking on #498

This PR deletes Aevatar.GAgents.Scheduled.SkillRunnerGAgent and replaces it
with SkillDefinitionGAgent + SkillExecutionGAgent. Live SkillRunner actors
in production have RuntimeActorGrainState.AgentTypeName = "Aevatar.GAgents.Scheduled.SkillRunnerGAgent" written by every prior daily
activation. After deployment, RuntimeActorGrain.OnActivateAsync calls
ResolveAgentTypeType.GetType + AppDomain.GetAssemblies() and finds
nothing — the class no longer exists. Activation fails; subsequent envelopes
are dropped (Dropping envelope for actor {ActorId}: initialization failed).

SkillRunnerLegacyAliases.cs ([LegacyClrTypeName] / [LegacyProtoFullName])
only handles payload codec compatibility via
ProtobufContractCompatibility.IsCompatibleClrTypeName — it does not
rewrite the runtime actor implementation CLR type stored in
RuntimeActorGrainState.AgentTypeName. This is exactly the activation-failure
mode PR #495 fixes for the deleted ChannelRuntime types, but applied to live
business actors (every scheduled daily / social / report skill) instead of
retired ones.

Per #498, the correct resolution is the stable AgentKind + IAgentKindRegistry
identity model, not a one-off retirement spec or preserve/migrate runtime-type
rewrite. Once #498 Phase 1 lands in dev, this PR ships as a kind-alias
registration on SkillDefinitionGAgent (e.g. [LegacyAgentKind("skill-runner")])
with zero production state mutation and natural mixed-version rollout safety.

Do not merge until #498 Phase 1 is in dev and this PR is rebased on top of
it.
Architecture-correct path is the long-term direction; we are not
shortcutting it with a one-time governance patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor(skill-runner): split SkillRunnerGAgent into SkillDefinitionGAgent + SkillExecutionGAgent

1 participant