Skip to content

Fix retired ChannelRuntime startup cleanup#495

Merged
eanzhao merged 11 commits intodevfrom
fix/2026-04-28_retired-channelruntime-cleanup
Apr 28, 2026
Merged

Fix retired ChannelRuntime startup cleanup#495
eanzhao merged 11 commits intodevfrom
fix/2026-04-28_retired-channelruntime-cleanup

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Apr 28, 2026

Problem

Persisted runtime actors from the retired Aevatar.GAgents.ChannelRuntime assembly can be activated during pod startup before the new registration projection startup path rebuilds them. Orleans then fails to resolve old agent/materialization-context type names and the host can abort startup.

Solution

  • Add a maintenance-only event-store reset capability and implement it for InMemory, File, and Garnet event stores.
  • Add RetiredChannelRuntimeActorCleanupHostedService in Mainnet host composition. It takes a one-time event-store marker lease, probes persisted actor runtime types, removes retired root/projection actors, removes stale relays, clears stale read models, resets retired actor streams, and writes a completion marker.
  • Register the cleanup hosted service before ChannelRuntime, Device, and Scheduled projection startup services.
  • Document the upgrade/runbook contract for operators.

Impact Paths

  • src/Aevatar.Mainnet.Host.Api/Hosting/Migration/*
  • src/Aevatar.Foundation.*/*EventStore*
  • src/Aevatar.Mainnet.Host.Api/Hosting/MainnetHostBuilderExtensions.cs
  • docs/operations/2026-04-28-retired-channelruntime-cleanup-runbook.md

Validation

  • dotnet test test/Aevatar.Foundation.Core.Tests/Aevatar.Foundation.Core.Tests.csproj --nologo --filter "ResetStreamAsync"
  • dotnet test test/Aevatar.Hosting.Tests/Aevatar.Hosting.Tests.csproj --nologo --filter "RetiredChannelRuntimeActorCleanupHostedServiceTests|AddAevatarMainnetHost_ShouldRunRetiredChannelRuntimeCleanup"
  • bash tools/ci/test_stability_guards.sh
  • bash tools/ci/query_projection_priming_guard.sh

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0776a16a69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +141 to +145
var runtimeTypeName = await _typeProbe
.GetRuntimeAgentTypeNameAsync(target.ActorId, ct)
.ConfigureAwait(false);
if (!MatchesRetiredType(target, runtimeTypeName))
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Continue reset path on reruns after partial cleanup

This early return makes the cleanup non-idempotent across crashes: if a pod destroys a retired actor but dies before ResetStreamAsync or lease completion, the next pod will read a now-empty runtime type, hit this return, and permanently skip stream reset for that actor. That leaves stale committed events in the store, so recreated actors can replay old ChannelRuntime history/state even though the migration marker eventually completes.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 77.26161% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.06%. Comparing base (dff5b4b) to head (eaa0abf).
⚠️ Report is 12 commits behind head on dev.

Files with missing lines Patch % Lines
...ng/Maintenance/RetiredActorCleanupHostedService.cs 72.93% 55 Missing and 17 partials ⚠️
...istence.Implementations.Garnet/GarnetEventStore.cs 0.00% 13 Missing ⚠️
...ion.Abstractions/Maintenance/RetiredActorTarget.cs 94.64% 1 Missing and 2 partials ⚠️
....Hosting/Maintenance/RetiredActorCleanupOptions.cs 93.54% 0 Missing and 2 partials ⚠️
...r.Foundation.Runtime/Persistence/FileEventStore.cs 86.66% 1 Missing and 1 partial ⚠️
...ation.Abstractions/Maintenance/RetiredActorSpec.cs 80.00% 1 Missing ⚠️
@@            Coverage Diff             @@
##              dev     #495      +/-   ##
==========================================
+ Coverage   71.04%   71.06%   +0.02%     
==========================================
  Files        1230     1235       +5     
  Lines       89062    89463     +401     
  Branches    11654    11705      +51     
==========================================
+ Hits        63270    63573     +303     
- Misses      21256    21332      +76     
- Partials     4536     4558      +22     
Flag Coverage Δ
ci 71.06% <77.26%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../RetiredActorCleanupServiceCollectionExtensions.cs 100.00% <100.00%> (ø)
...DependencyInjection/ServiceCollectionExtensions.cs 98.36% <100.00%> (+0.11%) ⬆️
...DependencyInjection/ServiceCollectionExtensions.cs 92.48% <100.00%> (+0.11%) ⬆️
...mplementations.Orleans/Grains/RuntimeActorGrain.cs 62.00% <100.00%> (-0.65%) ⬇️
...DependencyInjection/ServiceCollectionExtensions.cs 62.50% <100.00%> (+3.40%) ⬆️
...undation.Runtime/Persistence/InMemoryEventStore.cs 93.65% <100.00%> (+0.66%) ⬆️
...t.Host.Api/Hosting/MainnetHostBuilderExtensions.cs 83.33% <100.00%> (+0.21%) ⬆️
...ation.Abstractions/Maintenance/RetiredActorSpec.cs 80.00% <80.00%> (ø)
....Hosting/Maintenance/RetiredActorCleanupOptions.cs 93.54% <93.54%> (ø)
...r.Foundation.Runtime/Persistence/FileEventStore.cs 72.38% <86.66%> (+1.09%) ⬆️
... and 3 more

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the cleanup path. I found one blocker for the old-agent cleanup claim and left an inline comment.

"Aevatar.GAgents.ChannelRuntime.AgentRegistryMaterializationContext",
],
SourceStreamId: "agent-registry-store"),
];
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 target list 还不能清理旧的 user agents。这里只覆盖了 3 个 well-known store actor 和 3 个 projection scope actor,但历史 Aevatar.GAgents.ChannelRuntime.SkillRunnerGAgent / WorkflowAgentGAgent 这类按 skill-runner-*workflow-agent-* 生成的 actor 不会被枚举、DestroyAsyncResetStreamAsync,也不会 PurgeActorAsync 掉它们的 durable callbacks。结果是 catalog/read model 被清空后,旧 agent grain 仍然留着 retired AgentTypeName;后续 reminder 触发或按旧 agent_id 调用 command port 时,GetAsync 会把它当作已初始化 actor 返回,dispatch 仍会落到无法 resolve 的旧 runtime type 上。需要在删除/重置 catalog 之前拿到旧 catalog 里的 agent ids,并对这些旧 agent 做 destroy + callback purge + event stream reset,或者明确说明本 migration 只清 store/projection,不保证清 old agents。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: Retired ChannelRuntime startup cleanup

Overall direction is correct — probe → match retired type → destroy → reset stream is the right flow. A few issues that need addressing before landing:

1. Read model cleanup is unreachable at startup

RetiredChannelRuntimeActorCleanupHostedService runs before ChannelBotRegistrationStartupService, DeviceRegistrationStartupService, and UserAgentCatalogStartupService (validated in MainnetHostCompositionTests). If those startup services are responsible for initializing the projection store (Elasticsearch index creation, etc.), the IProjectionDocumentReader / IProjectionWriteDispatcher queries in CleanupReadModelsAsync will either:

  • Return empty (store not ready → index doesn't exist → query returns nothing), silently skipping cleanup, OR
  • Throw (store throws on uninitialized index), which propagates uncaught out of StartAsync and crashes pod startup

The second case is the dangerous one — StartAsync only catches OperationCanceledException, so any projection store exception kills the host.

Recommendation: Either:

  • Wrap CleanupReadModelsAsync in a try-catch that logs a warning and continues, or
  • Defer read model cleanup to after projection startup (e.g., a separate second-phase hosted service), or
  • Default CleanupReadModels to false and make it opt-in

2. Read model cleanup has zero test coverage

Both test cases (StartAsync_ShouldDestroyRetiredActors_RemoveRelays_AndResetEventStreams and StartAsync_ShouldNotDestroyActor_WhenRuntimeTypeIsCurrent) pass an empty ServiceCollection().BuildServiceProvider(). No projection IProjectionDocumentReader/IProjectionWriteDispatcher are registered, so CleanupReadModelsAsync silently does nothing. The read-model cleanup path is completely unexercised.

Given issue #1 above, this path needs a test that:

  • Registers real (or stub) projection readers/writers
  • Verifies matching documents are deleted
  • Verifies non-matching documents are untouched
  • Verifies the service doesn't crash when the store throws

3. Garnet ResetStreamAsync is untested (0% patch coverage)

GarnetEventStore.ResetStreamAsync has 13 uncovered lines. There are no tests for the Garnet implementation of IEventStoreMaintenance. The InMemory and File stores are tested, but Garnet is the production path.

4. Type probe triggers noisy error logs per cleaned actor

RuntimeActorGrain.OnActivateAsync calls InitializeAgentInternalAsync, which calls ResolveAgentType for the retired type name. This logs:

_logger.LogError("Unable to resolve agent type {AgentTypeName}", agentTypeName);

For each of the 6 targets, the probe triggers an ERROR log during cleanup. This is acceptable for a one-time migration but should be documented explicitly in the runbook so operators don't mistake it for a problem.

5. MatchesRetiredType uses string.Contains — could false-positive

return target.RetiredTypeTokens.Any(token =>
    runtimeTypeName.Contains(token, StringComparison.Ordinal));

Contains with Ordinal comparison will match any substring. Given Aevatar.GAgents.ChannelRuntime.ChannelBotRegistrationGAgent as the token, the risk is low in practice, but consider whether StartsWith or exact assembly-qualified name matching (after stripping the assembly suffix) would be more precise.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review Summary

The approach is sound for the startup-blocking problem (projection scopes, registration stores). The probe → destroy → reset → cleanup chain is correct. But there are gaps in coverage.


Issue 1: ChannelUserGAgent (and other dynamic actors) are NOT cleaned up

The Targets array is a hardcoded list of 6 known actor IDs. But the old Aevatar.GAgents.ChannelRuntime namespace had 7 GAgent classes, 3 of which have dynamic IDs and are NOT in the list:

Actor ID Pattern In Cleanup?
ChannelBotRegistrationGAgent channel-bot-registration-store
DeviceRegistrationGAgent device-registration-store
UserAgentCatalogGAgent agent-registry-store
AgentRegistryGAgent agent-registry-store
ChannelUserGAgent channel-user-{platform}-{registrationId}-{senderId}
SkillRunnerGAgent dynamic
WorkflowAgentGAgent dynamic

ChannelUserGAgent is the most significant — it is a per-user, per-channel actor with potentially many instances in production. Each one has event-sourced state (ChannelUserState) persisted in both Orleans grain storage and the Garnet event store.

These won't cause startup failure (they are not activated at startup), but they will fail when a user interacts with an old channel session, and they leave orphaned data in Redis forever.

Recommendation: Add a Redis SCAN-based discovery step for the channel-user-* prefix, or document that these will be garbage-collected through natural TTL / manual intervention.


Issue 2: The PR description claims "removes retired root/projection actors" — scope is narrower

The PR description says it "removes retired root/projection actors, removes stale relays, clears stale read models, resets retired actor streams." This is accurate for the 6 hardcoded targets. But operators reading the runbook might assume ALL old ChannelRuntime actors are cleaned up, which is not the case.

The runbook should explicitly state that only the 6 well-known projection/registration actors are covered, and that dynamic per-user actors (ChannelUserGAgent) are out of scope.


Issue 3: Orleans grain state key is emptied, not deleted

PurgeAsync() writes zeroed-out state back to Redis rather than deleting the key. This is functionally correct (the AgentTypeName is null, so Orleans will not try to instantiate the old type), but leaves empty stub keys in Redis. Minor concern — worth noting for operators who want a clean Redis.


What works correctly

  • IActorTypeProbe reads persisted AgentTypeName from grain state without needing to instantiate the agent — correct approach
  • DestroyAsyncPurgeAsync clears AgentTypeName and writes to storage — prevents future activation failures
  • ResetStreamAsync deletes all 3 Garnet keys per agent (version, index, data) — complete
  • RemoveRelayAsync properly cleans up StreamTopologyGrain persisted state
  • Lease mechanism with stale detection is well-designed for multi-pod safety
  • Ordering guarantee: cleanup runs before ChannelRuntime/Device/Scheduled projection startup services (verified by test)
  • Read model cleanup uses ActorId filter — correctly scoped to the target actor

Verdict

The PR solves the startup-blocking problem — the 6 projection/registration actors that cause pod startup abort will be cleaned up correctly. But it does not cover ChannelUserGAgent and other dynamic actors, which will remain orphaned. Whether this is acceptable depends on whether production has active ChannelUserGAgent instances from before the namespace split.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review Summary

能解决问题吗?

能,但覆盖范围有限。这个 PR 只清理了 6 个硬编码的 root/store actor,对于由这些 actor 动态创建的 child actors(如实际的 channel bot 实例)没有任何清理逻辑。如果集群中存在大量动态 actor,它们仍然会在后续 Orleans 激活时触发类型解析失败。

关键问题

  1. 硬编码 Targets 范围过窄
    只列出了 6 个 actor ID(、 等)。ChannelRuntime 业务逻辑会创建大量动态 actor(如每个 channel bot 一个 agent),这些不会被清理。建议:要么补充动态 actor 的发现/清理逻辑,要么在 runbook 中明确说明操作员需要手动处理残留的 dynamic actors。

  2. ** 使用宽松字符串匹配**
    过于宽松。如果未来某个新类型的全名中碰巧包含了这些 token(如 ),会被误杀。建议改为精确的类型名匹配或至少用 + 逗号(程序集分隔符)来限定。

  3. Marker Stream 与 Actor Stream 混用同一个 EventStore
    Marker 使用 作为 stream ID,如果这个 stream 被其他逻辑误清理,或者命名冲突,会丢失进度。更关键的是:这个 marker 没有 TTL/过期机制,如果 cleanup 中途 pod 崩溃且未超时,其他 pod 需要等 (默认 5 分钟)才能接手,这在高可用场景下会让启动阻塞。

  4. 缺少 child/orphaned actor 清理
    只清理了 root actor 和 projection scope,没有处理这些 actor 可能创建的 child actors 或 links。如果 Orleans 后续尝试激活这些 orphaned children,仍然会因为父类型不存在而失败。

  5. 异常处理与部分失败
    中各步骤(RemoveRelay → CleanupReadModels → Destroy → ResetStream)之间没有事务性保障。如果 RemoveRelay 或 Destroy 成功但 ResetStream 失败,actor 已被销毁但 stream 残留,下次启动时可能会尝试激活一个无状态的残留 actor。

  6. 测试覆盖不足
    现有测试验证了 happy path 和 skip-current-type 场景,但缺少:

    • lease 竞争/超时场景
    • 部分失败后的重试行为
    • 并发多 pod 启动时的互斥性
    • marker stream 被意外清理后的行为

建议

  • 如果目标是彻底解决 ChannelRuntime 退役后的启动失败,需要补充动态 actor 的扫描和清理(或至少提供操作 runbook 让运维手动清理)。
  • 当前 PR 可以作为缓解措施合并,但建议在 runbook 中加上“已知限制:不清理动态创建的 channel bot agents”的明确说明。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: 这个 PR 不能可靠地清理 old agents

我仔细 trace 了所有运行时依赖,发现几个根本性问题:


P0: 用 IActorTypeProbe 探测 retired grain 反而会触发激活失败

OrleansActorTypeProbe.GetRuntimeAgentTypeNameAsync 调用 grain.GetAgentTypeNameAsync(),这会触发完整 Orleans grain 激活。在 RuntimeActorGrain.OnActivateAsync 里:

  • 先订阅 self-stream(line 64)然后才尝试解析 agent type(line 66-67)
  • InitializeAgentInternalAsync 无法 resolve Aevatar.GAgents.ChannelRuntime.* 时,grain 已经激活并带了一个 zombie stream subscription

也就是说,检查动作本身就导致了这个 PR 要防止的激活失败。如果 Mainnet 上 retired type 的 assembly 完全不存在,Type.GetType 失败可能导致 grain 激活直接抛异常,cleanup hosted service 整个卡死。

建议: 直接从存储层(Redis/Garnet)读取 grain state blob 的 AgentTypeName,绕过 Orleans 激活。


P0: 崩溃后重跑会永久跳过 event stream reset(Codecov 已部分标记)

CleanupTargetAsync 里:

var runtimeTypeName = await _typeProbe.GetRuntimeAgentTypeNameAsync(target.ActorId, ct);
if (!MatchesRetiredType(target, runtimeTypeName))
    return;  // ← 提前返回

如果 pod 在 DestroyAsync 后、lease completion 前崩溃:

  1. 下一个 pod 重新 probe → PurgeAsync 已经把 AgentTypeName 清为 null → runtimeTypeName 为空
  2. MatchesRetiredType 返回 false → 永久跳过 ResetStreamAsync
  3. 旧事件残留在 event store 里
  4. migration marker 最终 complete,后续不再清理

意味着部分清理的 actor 的 orphaned event stream 会永远存在。如果新的 projection startup 用同一个 ID 创建 actor,可能 replay 到旧 ChannelRuntime 历史。

建议: 既然这是一次性 migration,对所有 target 无条件执行 ResetStreamAsync,或在 marker stream 里记录 per-target 完成状态。


P1: DestroyAsync 只写空 state,不删除 storage record

RuntimeActorGrain.PurgeAsync 调用 WriteStateAsync() 写入空字段,但不调用 ClearStateAsync()。grain state blob 在 Redis 里依然存在(AgentId="", AgentTypeName=null)。

结合 RuntimeActorGrain 上的 [ImplicitStreamSubscription("aevatar.actor.events")],任何 stale message 到达 actor 的 stream 都会让 Orleans 重新激活这个 grain → 变成 empty zombie。

建议: 在 purge 路径使用 ClearStateAsync() 或在 cleanup service 里显式删除 grain storage。


P1: 被销毁 actor 自身的 StreamTopologyGrain 未清理

Cleanup 只移除了 destroyed actor 作为 target 的 relay binding(通过 RemoveRelayAsync),但没有清理以 destroyed actor ID 为 key 的 StreamTopologyGrain。如果 destroyed actor 有 outgoing forwarding binding,这些会残留。


P2: ReadModel cleanup 对每个 target 都查 4 种 document type

CleanupReadModelsAsync 对每个 target actor 都查 ChannelBotRegistrationDocumentDeviceRegistrationDocumentUserAgentCatalogDocumentUserAgentCatalogNyxCredentialDocument。比如 agent-registry-store 会查 ChannelBotRegistrationDocument,永远不会匹配。不是正确性问题,但浪费且令人困惑。


总结

当前 "activate → check → destroy" 流程有根本矛盾:检查就会触发问题。建议改为:

  1. 直接从存储层读 grain state 判断是否需要清理(绕过激活)
  2. 删除 grain storage record(而非写空 state)
  3. 一次性 cleanup 对所有 target 无条件 reset event stream
  4. 清理每个 destroyed actor 的 StreamTopologyGrain
  5. 在 marker stream 里记录 per-target 完成状态,支持崩溃恢复

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Fix Review

The new commit (4ba78d05) addresses SkillRunnerGAgent and WorkflowAgentGAgent — good approach reading the catalog event stream to discover dynamic actor IDs instead of hardcoding.

What is fixed

  • SkillRunnerGAgent / WorkflowAgentGAgent — dynamically discovered from agent-registry-store events via DiscoverRetiredCatalogUserAgentTargetsAsync. Actors matching skill-runner-* or workflow-agent-* prefixes are cleaned up before the catalog itself is reset. ✅
  • ResetWhenRuntimeTypeUnavailable — handles partial cleanup from a crashed previous pod (actor destroyed but event stream remains). ✅
  • Read model cleanup is now best-effort with try-catch — a projection store failure no longer blocks actor destruction. ✅
  • Type matching improved with ContainsTypeNameToken boundary checks — prevents false positives on partial name matches. ✅
  • Runbook updated to document the discovery mechanism and out-of-scope actors. ✅
  • Good test coverage: catalog discovery, current-type skip, already-destroyed recovery, read model cleanup, read model failure resilience. ✅

Remaining gap: ChannelUserGAgent

ChannelUserGAgent (channel-user-{platform}-{registrationId}-{senderId}) is still not covered. It is not in the catalog event stream — it was a per-conversation actor created on-demand, not through the agent registry.

Whether this matters depends on production state:

  • If no ChannelUserGAgent instances with old type names exist in Redis → no issue
  • If they do exist → they will fail when a user interacts with an old channel session, and leave orphaned grain state + event streams in Redis forever

This is not a startup blocker (these actors are activated on-demand, not at startup), so it is acceptable to defer. But the runbook should explicitly call this out so operators know to check.

Suggested runbook addition

## Out of scope

`ChannelUserGAgent` instances (keyed `channel-user-*`) were per-conversation actors
not registered in the agent catalog. They are not discovered by this cleanup service.

If production had active ChannelUserGAgent instances before the namespace split,
their grain state and event streams remain in Redis. They will fail on user
interaction with "Unable to resolve agent type Aevatar.GAgents.ChannelRuntime.ChannelUserGAgent".

To clean these up manually, scan Redis for keys matching the Orleans grain storage
pattern with `AgentTypeName` containing `Aevatar.GAgents.ChannelRuntime.ChannelUserGAgent`,
then delete them along with their event store keys.

Verdict

LGTM for the startup-blocking problem. The ChannelUserGAgent gap is a deferred cleanup concern, not a blocker, but should be documented.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Follow-up review

All four issues from the previous review are addressed:

Concern Status Fix
Read model cleanup crashing pod Fixed CleanupReadModelsBestEffortAsync with try-catch (re-throws only OperationCanceledException)
Read model cleanup zero test coverage Fixed ShouldDeleteMatchingReadModels + ShouldContinue_WhenReadModelCleanupThrows
Garnet ResetStreamAsync untested Fixed Integrated into GarnetEventStore_ShouldAppendReplayAndCompactEvents
MatchesRetiredType Contains false-positive risk Fixed Replaced with ContainsTypeNameToken using IsTypeNameBoundary ([, ], ,, )

New additions look correct:

  • Catalog user agent discovery from events — SkillsRunnerGAgent/WorkflowAgentGAgent discovered and cleaned before the catalog itself
  • ResetWhenRuntimeTypeUnavailable handles partially-cleaned state (grain state gone but event stream persists)
  • ShouldResetStream_WhenActorStateWasAlreadyDestroyed tests that case

One minor observation: DiscoverCatalogUserAgentIdsAsync replays all catalog events. For a long-running catalog with many agents, this could be heavy — but it's a one-shot migration so acceptable.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review of fix (commit 3/3)

修了不少,核心逻辑有明显改善。逐一对照上次 review:


P0: 部分清理后重跑 — ✅ 已修复

CleanupTargetAsync 现在处理了 runtimeTypeName 为 null 的场景:

if (!matchesRetiredRuntimeType)
{
    if (!string.IsNullOrWhiteSpace(runtimeTypeName))
        return;
    shouldContinueReset = target.ResetWhenRuntimeTypeUnavailable &&
                          await HasEventStreamAsync(target.ActorId, ct);
    if (!shouldContinueReset)
        return;
}

AgentTypeName 已被 PurgeAsync 清空但 event stream 还在时,会继续执行 ResetStreamAsync。这解决了崩溃重跑跳过 stream reset 的问题。


P0: IActorTypeProbe 激活 retired grain — ⚠️ 未修,但加了缓解

GetRuntimeAgentTypeNameAsync 仍然会触发 grain 激活。代码没有改为直接读存储层。

但有一个实际缓解:cleanup service 注册在所有 projection startup service 之前,所以 Orleans silo 在 cleanup 执行时可能还没完全启动。如果 Orleans runtime 在 AddHostedServiceStartAsync 阶段还不能 activate grain,那 type probe 调用会直接失败而非触发 broken activation。

但这不是可靠的假设。 Orleans silo 可能在 IHostedService.StartAsync 之前就已经接受 grain activation 请求了。建议确认 Orleans lifecycle 行为,或仍然改为直接读存储。


P1: DestroyAsync 不删除 storage record — ❌ 未修

RuntimeActorGrain.PurgeAsync 仍然是 WriteStateAsync() 空 state,不是 ClearStateAsync()。grain state blob 在 Redis 里依然存在。配合 [ImplicitStreamSubscription],stale stream message 可以复活 zombie grain。

这个问题要改 RuntimeActorGrain 核心代码,可以理解不在本 PR scope,但建议:

  1. 至少在 runbook 里记录这个风险
  2. 在 cleanup 里显式删除被销毁 actor 的 StreamTopologyGrain

P1: StreamTopologyGrain 未清理 — ❌ 未修

Cleanup 移除了 relay binding(target 是 destroyed actor 的),但没有 ClearAsync() destroyed actor 自己的 StreamTopologyGrain。outgoing forwarding binding 会残留。


新增功能 — UserAgent 动态发现

DiscoverRetiredCatalogUserAgentTargetsAsync 从 catalog event stream 中解析所有 user agent ID,然后按前缀/类型过滤出 SkillRunner/WorkflowAgent。这是对上次 review 关于"只清理 6 个硬编码 target"的回应。

实现基本正确,但有几个注意点:

  1. event stream 可能在 cleanup 执行前已被部分清理:如果上一轮 cleanup 已经 reset 了 agent-registry-store 的 event stream,这一轮 DiscoverRetiredCatalogUserAgentTargetsAsync 读到的就是空,所有动态 user agent 都会被跳过。但因为 lease marker 是全局一次性的,这种情况只会发生在上一轮部分成功但 marker 未 complete 时。结合 HasEventStreamAsync fallback,这个 edge case 可以工作。

  2. IsGeneratedUserAgent 只按前缀和 AgentType 匹配:如果历史上有非标准前缀的 agent,会被遗漏。对于 migration 场景可以接受。

  3. ReadModel cleanup 现在有 best-effort wrapperCleanupReadModelsBestEffortAsync catch 异常后继续,这是对的。


总结

问题 状态
P0: 崩溃重跑跳过 reset ✅ 已修
P0: type probe 触发激活 ⚠️ 未修,有隐式缓解但不可靠
P1: storage record 未删 ❌ 未修(需改核心 grain)
P1: StreamTopologyGrain 残留 ❌ 未修
动态 user agent 发现 ✅ 新增,基本正确
ReadModel cleanup 容错 ✅ 已修

建议: P0 activation 问题至少需要在 runbook 里记录风险。StreamTopologyGrain 清理可以在本 PR 里加上(调用 _streamProvider.GetStream(target.ActorId).ClearAsync() 或等效操作)。其余可以 ack 后 merge。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Re-review: Fix 后评估 ✅

核心问题修复情况

能清理 old agents 吗?基本能。

问题 状态 说明
硬编码 6 个 root actor ✅ 保留 channel-bot-registration-store
动态 skill-runner-* / workflow-agent-* 新增 agent-registry-store 事件流反向发现
Contains 误杀风险 已修复 改为 ContainsTypeNameToken,检查 [], 边界
部分清理后重入 已修复 shouldContinueReset + HasEventStreamAsync,pod 崩溃可续
ReadModel 失败阻断启动 已修复 CleanupReadModelsBestEffortAsync 捕获异常仅记日志

关键改进

  1. 动态 Actor 发现逻辑合理
    DiscoverCatalogUserAgentIdsAsync 读取 UserAgentCatalogUpsertedEvent/TombstonedEvent/ExecutionUpdatedEvent/TombstonesCompactedEvent,用 HashSet 去重,然后通过 IsGeneratedUserAgent 过滤 skill-runner-*workflow-agent-*。这比硬编码更贴合实际数据分布。

  2. 类型匹配安全性提升

    ContainsTypeNameToken(runtimeTypeName, token)
    // 检查 token 前后是否为 [ ], 空格 边界

    基本消除了 Aevatar.GAgents.ChannelRuntime.V2.Xxx 被误杀的可能。

还缺的测试

建议补充一个专门的单元测试验证 ContainsTypeNameToken 边界行为:

  • Aevatar.GAgents.ChannelRuntime.SkillRunnerGAgent → 命中 ✅
  • Aevatar.GAgents.ChannelRuntime.V2.SkillRunnerGAgent → 不命中 ✅
  • Aevatar.CQRS.Projection.Core.Orchestration.ProjectionMaterializationScopeGAgent`1[[Aevatar.GAgents.ChannelRuntime.ChannelBotRegistrationMaterializationContext]] → 命中 ✅

目前测试里只有正向 cases,缺少 negative cases。

结论

这个 fix 把 PR 从"缓解措施"提升到了"生产可用"。动态 actor 清理 + 重入安全 + 宽松失败处理,已经能覆盖 ChannelRuntime 退役的主要残留场景。建议合并前补上 ContainsTypeNameToken 的边界测试即可。

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

发现一个仍会漏清 old agents 的场景,已 inline 标出。


private async Task<IReadOnlyList<string>> DiscoverCatalogUserAgentIdsAsync(CancellationToken ct)
{
var events = await _eventStore
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还有一个 blocker:只扫 agent-registry-store 的 event stream 在生产里不可靠。

UserAgentCatalogGAgentGAgentBase<UserAgentCatalogState>,而 runtime 默认开启 snapshot + compaction(EnableSnapshots=true, SnapshotInterval=200, EnableEventCompaction=true, RetainedEventsAfterSnapshot=0)。actor deactivation 后当前 catalog state 会写入 RuntimeActorGrainState.AgentStateSnapshot,随后 compaction 会删除 snapshot version 及以前的 events。

这种情况下 GetEventsAsync(agent-registry-store) 只能看到 snapshot 之后的增量;snapshot 里已有、但近期未变更的 skill-runner-* / workflow-agent-* 不会被发现。后面本 service 又会 destroy/reset catalog,把 snapshot 和剩余 stream 都清掉,漏掉的 old agent id 就没法再恢复或清理。

需要在 reset catalog 之前同时读取 catalog snapshot/current state(或可靠 read model)并合并剩余 events,最好补一个 “catalog 已 snapshot+compaction 后仍能发现 old agents” 的测试。否则这个 fix 仍不能保证清理 old agents。

eanzhao and others added 3 commits April 28, 2026 19:33
Extract the one-off ChannelRuntime cleanup into a generic, spec-driven
mechanism that runs on every pod startup and is intrinsically idempotent.

- IRetiredActorSpec abstraction in Aevatar.Foundation.Abstractions.Maintenance:
  modules declare retired type tokens, well-known targets, optional dynamic
  discovery, and optional read-model cleanup.
- Generic RetiredActorCleanupHostedService in Aevatar.Foundation.Runtime.Hosting
  iterates every registered IRetiredActorSpec under a per-spec lease
  (__maintenance:retired-actor-cleanup:{specId}). No "completed forever"
  marker — each spec runs every startup and is a no-op when no targets
  match.
- ChannelRuntime/Device/Scheduled each ship their own IRetiredActorSpec
  alongside their DI extension; the user-agent catalog walker lives with
  Scheduled. Adding a new retired type only requires registering a new
  spec — Mainnet.Host.Api stays untouched.
- Mainnet.Host.Api wires AddRetiredActorCleanup() in place of the old
  module-specific hosted service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Fix Review (d13b1c2b)

Changes look solid. All previously raised issues are addressed.

What changed

ClearStateAsync() instead of WriteStateAsync() in PurgeAsync — the Orleans grain state key is now actually deleted from Redis, not just emptied. This is correct. After ClearStateAsync(), OnActivateAsync will see _state.State.AgentTypeName as null (default constructor) and skip InitializeAgentInternalAsync. No stale type resolution possible. ✅

CleanupOutgoingRelaysBestEffortAsync — lists outgoing relays on the actor stream and removes them one by one. Best-effort with try-catch. This catches the edge case where a stale child relay entry lives on the parent stream after the parent-child relationship was broken. ✅

Runbook: Out of Scope section — documents ChannelUserGAgent explicitly with manual cleanup instructions (scan Orleans grain storage for channel-user-* records with retired AgentTypeName). ✅

New tests:

  • ShouldNotDestroyActor_WhenRetiredTypeTokenIsOnlySubstring — verifies boundary check rejects ChannelBotRegistrationGAgentProxy matching ChannelBotRegistrationGAgent. ✅
  • skill-runner-proxy in catalog discovery test — same boundary check for dynamic actors. ✅
  • SeedRelay + stale-child-stream assertion — verifies outgoing relay cleanup. ✅
  • ClearCount tracking in OrleansDistributedCoverageTests — aligns with ClearStateAsync change. ✅

One minor observation

PurgeAsync now does _state.State = new RuntimeActorGrainState() then ClearStateAsync(). The assignment is redundant since ClearStateAsync() deletes the key regardless. Not a bug, just unnecessary — _state.ClearStateAsync() alone would suffice. Very minor, ignore if you want to keep it readable.

Verdict

LGTM. Ready to merge.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review of latest commits (generic cleanup)

对比上次 review 的问题逐条检查:


P1: PurgeAsync storage record — ✅ 已修

- _state.State.AgentId = string.Empty;
- _state.State.AgentTypeName = null;
- ...
- await _state.WriteStateAsync();
+ _state.State = new RuntimeActorGrainState();
+ await _state.ClearStateAsync();

ClearStateAsync() 会真正删除 grain storage record,不再残留空 state blob。这同时消除了 [ImplicitStreamSubscription] 复活 zombie 的风险。测试也同步改了(OrleansDistributedCoverageTestsWriteCountClearCount)。


P1: StreamTopologyGrain outgoing relays — ✅ 已修

新增了 CleanupOutgoingRelaysBestEffortAsync

var relays = await stream.ListRelaysAsync(ct);
foreach (var relay in relays)
    await stream.RemoveRelayAsync(relay.TargetStreamId, ct);

会清理被销毁 actor 的 outgoing forwarding binding。best-effort 包装,失败不阻塞。

测试也验证了:streamProvider.SeedRelay("channel-bot-registration-store", "stale-child-stream")streamProvider.RemovedRelays.Should().Contain(("channel-bot-registration-store", "stale-child-stream"))


Type matching 精确度 — ✅ 已修

ContainsTypeNameToken 现在用 boundary-aware matching(检查前后字符是 [, ], ,, 或字符串边界),不再纯 Contains

新增测试 ShouldNotDestroyActor_WhenRetiredTypeTokenIsOnlySubstring 验证 SkillRunnerGAgentProxy 不会被匹配为 SkillRunnerGAgent


Runbook 更新 — ✅

加了 "Out of Scope" section 明确说明 ChannelUserGAgent 不在自动清理范围,给了手动清理指引。也加了 activation 警告说明。


新增问题

1. DiscoverRetiredCatalogUserAgentTargetsAsync 里 probe catalog grain 可能失败

line 215 调用 _typeProbe.GetRuntimeAgentTypeNameAsync(UserAgentCatalogGAgent.WellKnownId, ct)。这个 probe 会激活 catalog grain。如果 catalog grain 本身就是 retired type,OnActivateAsync 可能抛异常(取决于 InitializeAgentInternalAsync 对无法 resolve 的 type 的处理方式),导致整个 discovery 阶段失败,动态 user agent 全部跳过。

测试里 stub 不激活真实 grain 所以不会触发这个问题。但生产 Orleans 环境下,这是唯一一个需要在 discovery 阶段就 probe 的 target(其他 target 的 probe 在 CleanupTargetAsync 里做)。

建议:DiscoverRetiredCatalogUserAgentTargetsAsync 里 catch grain activation 失败,fallback 到直接检查 event stream 是否存在(类似 shouldContinueReset 路径的逻辑)。

2. ShouldNotDestroyActor_WhenRetiredTypeTokenIsOnlySubstring 测试的 type probe 只配置了一个 actor

var typeProbe = new StubActorTypeProbe(new Dictionary<string, string?>
{
    ["channel-bot-registration-store"] = "...GAgentProxy,...",
});

其他 5 个 hardcoded target + catalog 都不在 typeProbe 里,会返回 null → 走 shouldContinueReset 路径 → event stream 存在所以会继续 cleanup。

也就是说这个测试之所以 pass,是因为除了 channel-bot-registration-store 之外,其他 target 都没有 event stream(没有 AppendSingleEventAsync)。如果未来有人给这个测试加更多 setup,可能意外 break。

这不是 bug,但测试的隐含前提比较脆弱。

3. Marker stream 永不清理

__maintenance:retired-channelruntime-cleanup:v1 的 marker event(InProgress + Completed)会永久残留在 event store。对于一次性 migration 可以接受,但如果 InProgressTimeoutSeconds 配置不当导致多次 retry,marker events 会累积。

低优先级,不影响正确性。


总结

上次提的 4 个 P0/P1 全部修复,cleanup 现在可以可靠地:

  • 彻底删除 grain storage record(不残留空 state)
  • 清理 outgoing stream topology
  • 精确匹配 retired type(不误杀)
  • 崩溃后重跑继续 reset

剩余的风险是 discovery 阶段的 catalog grain probe 可能在生产环境失败。建议加个 catch + fallback,或者确认 RuntimeActorGrain.OnActivateAsync 在 agent type 无法 resolve 时不会抛异常(只是 log error + skip init)。

整体 LGTM,可以 merge。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

通用化改造 Review

整体架构设计合理,从硬编码清理演进到了插件化的 spec-driven 框架。以下是需要关注的问题:

🔴 必须修复

1. RetiredActorReadModelHelpers 跨模块依赖污染

DeleteByActorAsync 放在 Aevatar.GAgents.Channel.Runtime 命名空间下,但 DeviceRetiredActorSpecScheduledRetiredActorSpec 都引用了它。这导致 Device/Scheduled 模块对 Channel.Runtime 产生了不必要的编译时依赖。

建议:把这个 helper 移到 Aevatar.Foundation.Abstractions.MaintenanceAevatar.Foundation.Runtime.Hosting 中,作为框架级基础设施提供。

2. 动态发现 + 崩溃重入的竞态窗口

ScheduledRetiredActorSpec.DiscoverDynamicTargetsAsync 读取 agent-registry-store event stream 来发现 skill-runner-* / workflow-agent-*。如果清理过程中 pod 在 destroy catalog 之后崩溃,另一个 pod 接手时:

  • catalog stream 可能已被重置(version=0)
  • 动态 targets 就无法被发现
  • 残留的 skill-runner/workflow-agent actors 永远不会被清理

建议:在 RetiredActorCleanupHostedService.RunSpecAsync 中,将动态发现的结果缓存到本地(或 marker stream 中),确保即使 catalog stream 被重置,后续重入仍能清理已发现的动态 targets。

🟡 建议改进

3. Lease 超时与清理时长不匹配

InProgressTimeoutSeconds 默认 300s。如果一个 spec 有大量 targets(例如几千个动态生成的 skill-runner),清理时间可能超过 300s,导致 lease 被其他 pod 抢走并重复执行。

建议

  • 要么将 InProgressTimeoutSeconds 默认设得更宽松(如 1800s)
  • 要么在清理过程中定期刷新 lease(heartbeat),而不是一次性等待完成

4. IEventStore 在 Spec 中的直接使用

ScheduledRetiredActorSpec.DiscoverDynamicTargetsAsync 直接注入 IEventStore 读取 catalog stream。虽然清理是特殊场景,但这违反了分层原则(spec 位于应用层,直接依赖基础设施)。

建议:提供一个更高层的抽象,如 IRetiredActorDiscoveryPort,由基础设施层实现,spec 只依赖端口。

5. 缺少分布式场景测试

现有测试覆盖了:

  • ✅ 单 spec 清理
  • ✅ 多 spec 清理
  • ✅ 动态发现
  • ✅ 边界匹配
  • ✅ 异常容错

但缺少:

  • ❌ lease 竞争(两个 pod 同时尝试获取同一个 spec 的 lease)
  • ❌ lease stale 超时后接手
  • ❌ 清理中途取消(OperationCanceledException)后的状态一致性

✅ 设计亮点

  1. 插件化架构IRetiredActorSpec + TryAddEnumerable 让每个模块自包含地声明退役 actors,不需要修改核心清理服务
  2. 幂等性保证:"released" 语义允许多次运行,实际清理前通过 typeProbe 检查避免重复工作
  3. Best-effort 容错:read model 和 relay 清理失败都不阻断主流程
  4. ContainsTypeNameToken 边界检查GAgentProxy 等变体不会被误杀

结论

通用化方向正确,架构干净。建议先修复问题 1(helper 位置)和问题 2(动态发现重入安全),这两个是生产环境风险点。问题 3 和 4 可以作为后续优化。

Copy link
Copy Markdown
Contributor Author

@eanzhao eanzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通用化后的最近提交我看了,仍有一个会影响 old agents 清理完整性的 blocker,已 inline 标出。

IEventStore eventStore,
CancellationToken ct)
{
var events = await eventStore
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 blocker 还在,只是从 hosted service 搬到了 Scheduled spec:只读 agent-registry-store 的 event stream 不能可靠发现 old generated agents。

UserAgentCatalogGAgent 是 event-sourced actor,runtime 默认开启 snapshot + compaction(EnableSnapshots=true, SnapshotInterval=200, EnableEventCompaction=true, RetainedEventsAfterSnapshot=0)。catalog 当前 entries 可能已经只在 RuntimeActorGrainState.AgentStateSnapshot 中,snapshot version 及以前的 upsert/tombstone events 会被 compaction 删除。这样 GetEventsAsync(UserAgentCatalogGAgent.WellKnownId) 只能看到 snapshot 之后的增量,snapshot 里已有但近期没变更的 skill-runner-* / workflow-agent-* 会被漏掉。

由于这个 spec 后续还会 destroy/reset agent-registry-store,漏掉的 actor id 会随 catalog snapshot/stream 一起消失,之后就无法再自动清理。需要在 reset catalog 之前读取 catalog current state/snapshot 并合并剩余 events,或者改用可靠的 read model/专用 maintenance reader;同时补一个“catalog 已 snapshot+compaction 后仍能发现 old generated agents”的测试。

[EnumeratorCancellation] CancellationToken ct)
{
var eventStore = services.GetRequiredService<IEventStore>();
var agentIds = await DiscoverCatalogUserAgentIdsAsync(eventStore, ct).ConfigureAwait(false);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里把原来的触发条件丢掉了:DiscoverDynamicTargetsAsync 现在每次启动都会无条件 replay agent-registry-store,然后对 catalog 里的每个 skill-runner-* / workflow-agent-* actor 做 type probe。对于已经迁完、catalog runtime type 已经是新命名空间的集群,这会在每次 pod 启动时扫描整个 catalog 历史并激活/探测大量当前 actor;catalog 越大,启动成本和噪声越高。之前的实现只在 agent-registry-store 自身还是 retired type,或 catalog state 已清但 stream 仍存在的 partial-cleanup 场景下才做动态发现。建议把这个 gate 放回 spec 层,或者让 generic cleanup 在调用 dynamic discovery 前先按 spec 的 catalog/root target 判断是否需要发现。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

通用化改造 Review - 最新提交

架构评估

通用化方向正确,从硬编码的 ChannelRuntime 清理演进到了插件化的 IRetiredActorSpec 框架。每个模块自包含地声明退役 actors,核心清理服务零改动即可支持新模块。

上一次 Review 问题的修复状态

问题 状态 说明
RetiredActorReadModelHelpers 跨模块依赖 未修复 仍在 Aevatar.GAgents.Channel.Runtime 命名空间,被 Device/Scheduled 引用
动态发现 + 崩溃重入竞态 未修复 ScheduledRetiredActorSpec 仍直接读取 catalog stream,无缓存
Lease 超时 300s 无 heartbeat 未修复 大量 targets 时可能被其他 pod 抢走
IEventStore 在 Spec 中直接使用 未修复 仍违反分层原则

新增的观察

1. ✅ Marker 语义改进
从 "completed forever" 改为 "released",清理服务每次启动都运行,天然幂等。这比之前的一次性标记更健壮。

2. ✅ PurgeAsync 正确性修复

// Before: 手动清空字段
_state.State.AgentId = string.Empty;
// ...
await _state.WriteStateAsync();

// After: 使用 ClearStateAsync
_state.State = new RuntimeActorGrainState();
await _state.ClearStateAsync();

这对 Orleans 的持久化语义更正确。

3. ⚠️ AddRetiredActorCleanup() 注册位置
MainnetHostBuilderExtensions 中位于 AddChannelRuntime/AddDeviceRegistration/AddScheduledAgents 之前,这确保了 cleanup hosted service 先于各模块的 startup service 执行。

但注意:RetiredActorCleanupHostedService 是通过 TryAddEnumerable 注册为 IHostedService,而各模块的 startup service 也是 IHostedService。ASP.NET Core 中所有 IHostedService 是并行启动的,不是按注册顺序串行启动的。

建议:如果必须确保 cleanup 先于各模块 startup,需要显式同步机制(如 IHostedService 启动时的依赖等待),而不是仅靠 DI 注册顺序。

代码层面的问题

4. ContainsTypeNameToken 仍然不够安全

private static bool IsTypeNameBoundary(char value) =>
    value is '[' or ']' or ',' or ' ';

如果退役类型名是另一个类型的前缀,仍可能误杀。例如:

  • Token: Aevatar.GAgents.ChannelRuntime.SkillRunnerGAgent
  • 实际类型: Aevatar.GAgents.ChannelRuntime.SkillRunnerGAgentV2

V2 前面没有边界字符(是 t),所以不会被误杀。但如果:

  • Token: Aevatar.GAgents.ChannelRuntime.Agent
  • 实际类型: Aevatar.GAgents.ChannelRuntime.AgentRegistryGAgent

这就危险了。好在目前的 token 都是完整的类名,不是前缀。

建议:在 RetiredActorTarget 的 XML 文档中明确要求 RetiredTypeTokens 必须是完整的类型名(或至少包含类名后的边界字符如 .)。

5. DiscoverDynamicTargetsAsync 的异常处理

public override async IAsyncEnumerable<RetiredActorTarget> DiscoverDynamicTargetsAsync(...)
{
    var eventStore = services.GetRequiredService<IEventStore>();
    var agentIds = await DiscoverCatalogUserAgentIdsAsync(eventStore, ct).ConfigureAwait(false);
    // ...
}

如果 GetRequiredService<IEventStore>() 失败(例如测试环境中未注册),会抛出异常并导致整个 spec 的清理中断。

建议:添加 try/catch 包裹,或在基类 RetiredActorSpec 中提供默认的异常处理。

结论

通用化架构设计良好,但上一次提出的 4 个生产环境问题均未修复。建议:

  1. 立即修复RetiredActorReadModelHelpers 移至 Foundation 层(编译时依赖污染)
  2. 高优修复:动态发现结果缓存(避免 catalog stream 被重置后丢失 targets)
  3. 中优修复:lease heartbeat 或增加默认超时(避免大集群重复清理)
  4. 低优修复IHostedService 启动顺序依赖的显式同步

其余部分可以合并。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: generalization round

The additions look solid overall: outgoing relay cleanup, boundary-aware type token matching, catalog user agent discovery, and partial-cleanup resilience all make sense. But there are two relay-cleanup issues that can crash pod startup:

1. Incoming relay removal is not best-effort — can crash the pod

In CleanupTargetAsync:

// ⚠️ NO error handling
if (!string.IsNullOrWhiteSpace(target.SourceStreamId))
{
    await _streamProvider
        .GetStream(target.SourceStreamId)
        .RemoveRelayAsync(target.ActorId, ct)
        .ConfigureAwait(false);
}

// ✅ wrapped in try-catch
await CleanupOutgoingRelaysBestEffortAsync(target.ActorId, ct).ConfigureAwait(false);

The outgoing relay cleanup is wrapped in try-catch (CleanupOutgoingRelaysBestEffortAsync), but the incoming relay removal is not. If RemoveRelayAsync on the parent stream throws (e.g., parent already destroyed, relay already deleted), the exception propagates out of StartAsync and crashes pod startup.

Fix: wrap the incoming removal similarly — either extract a CleanupIncomingRelaysBestEffortAsync pattern, or fold it into the existing best-effort method, or skip it for projection scopes entirely (see issue 2).

2. Double relay cleanup on projection scope targets

Root actors (e.g. channel-bot-registration-store) run first. Their CleanupOutgoingRelaysBestEffortAsync already deletes ALL outgoing relays, including the relay forwarding to projection.durable.scope:channel-bot-registration:....

Then when the projection scope target runs, SourceStreamId = "channel-bot-registration-store" causes a second RemoveRelayAsync on the same (now-deleted) relay. If the relay implementation is not idempotent on missing relays, this throws — and because it's the uncaught incoming path (issue 1), it crashes the pod.

Fix: either make incoming relay removal best-effort, or skip SourceStreamId-based removal when the source stream has already been cleaned (e.g., check HasEventStreamAsync on the source before trying RemoveRelayAsync).

3. Minor: HasEventStreamAsync in DiscoverRetiredCatalogUserAgentTargetsAsync

if (!catalogIsRetired &&
    !(string.IsNullOrWhiteSpace(catalogRuntimeTypeName) &&
      await HasEventStreamAsync(UserAgentCatalogGAgent.WellKnownId, ct).ConfigureAwait(false)))
{
    return [];
}

The HasEventStreamAsync check calls GetVersionAsync again even though GetRuntimeAgentTypeNameAsync already activated the grain and loaded its state. Consider caching the version from the type probe result to avoid a redundant event-store read — minor perf, but the catalog event replay that follows is the real cost anyway.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Final Review: 通用化改造完成评估

经过多轮 review,这个 PR 的核心功能已经成熟。以下是最终评估:

✅ 已解决的问题(生产可用)

问题 状态 验证
动态 actor 发现 ✅ 已修 ScheduledRetiredActorSpec 从 catalog event stream 发现 skill-runner/workflow-agent
类型匹配边界检查 ✅ 已修 ContainsTypeNameToken 防止 GAgentProxy 误杀
崩溃重入安全 ✅ 已修 ResetWhenRuntimeTypeUnavailable + HasEventStreamAsync
ReadModel 容错 ✅ 已修 CleanupReadModelsBestEffortAsync catch 非取消异常
Grain storage 清理 ✅ 已修 PurgeAsync 使用 ClearStateAsync 而非 WriteStateAsync
Outgoing relay 清理 ✅ 已修 CleanupOutgoingRelaysBestEffortAsync 清理 actor 的出站 binding
通用化架构 ✅ 已修 IRetiredActorSpec + RetiredActorCleanupHostedService

⚠️ 已知限制(建议后续 PR 修复)

问题 优先级 说明
RetiredActorReadModelHelpers 跨模块依赖 P2 Channel.Runtime 命名空间下,被 Device/Scheduled 引用
动态发现无缓存 P2 catalog stream 被重置后,重入 pod 无法发现动态 targets
Lease 无 heartbeat P3 300s 超时,大量 targets 时可能被其他 pod 抢走
IEventStore 在 Spec 中直接使用 P3 违反分层原则,但清理场景可接受
IHostedService 并行启动 P3 ASP.NET Core 并行启动,不保证 cleanup 先于 module startup

📋 测试覆盖

  • RetiredActorCleanupHostedServiceTests (551 行) 覆盖:
    • 退役 actor 清理(relay + stream reset)
    • 当前类型跳过
    • 边界匹配(GAgentProxy 不命中)
    • 已销毁 actor 的 stream 续清理
    • Catalog 动态发现(skill-runner/workflow-agent)
    • ReadModel 删除与容错
    • 多 spec 并发执行
  • MainnetHostCompositionTests 验证 cleanup hosted service 注册顺序
  • GarnetEventStoreIntegrationTests 验证 ResetStreamAsync
  • OrleansDistributedCoverageTests 验证 ClearStateAsync

🎯 结论

Approve — 这个 PR 解决了 ChannelRuntime 退役导致的 startup blocking 问题,通用化架构为未来的退役清理提供了可扩展的框架。

建议合并后在后续迭代中处理 P2/P3 的优化项,特别是:

  1. RetiredActorReadModelHelpers 移到 Foundation 层
  2. 为动态发现结果添加缓存机制(或文档化限制)

- Restore the catalog discovery gate that the refactor accidentally
  dropped: ScheduledRetiredActorSpec.DiscoverDynamicTargetsAsync now
  runs only when the catalog itself still looks retired (matches a
  retired runtime-type token, or runtime type unavailable but stream
  still has events). Warm clusters no longer replay
  agent-registry-store on every startup.
- Read generated agent ids from UserAgentCatalogDocument first, then
  merge catalog upsert events. The previous event-only walk dropped
  entries that had been swallowed by snapshot + compaction; the
  read-model survives that case.
- Hoist the type-token matching logic onto RetiredActorTarget so the
  hosted service and spec gate share one boundary-aware matcher.
- Tests: catalog walk skipped when runtime type is current; ids
  discovered from the read model when the catalog event stream has
  been compacted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
string? cursor = null;
do
{
var result = await reader.QueryAsync(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 read-model 补充发现最好做成 best-effort。现在 reader.QueryAsync(...) 抛出的非 cancellation 异常会从 DiscoverDynamicTargetsAsync 直接冒到 RunSpecAsync,而 hosted service 外层只吞 OperationCanceledException,结果是 projection store 临时不可用/查询不支持/映射异常时,启动清理本身会把 pod 启动打挂,并且 event-stream fallback 和后面的 static target cleanup 都不会执行。这里的 read-model path 是为 snapshot+compaction 补洞,建议 catch 非 cancellation 异常后记录 warning,然后继续跑 DiscoverFromCatalogEventsAsync;这样未 compaction 的集群仍可清理,compaction 场景只是降级而不是阻断启动。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Fix Review: Address spec-driven cleanup review comments ✅

Commit c3b96ec4 解决了上一次 review 中提出的几个关键问题。

✅ 已修复的问题

1. 恢复了 catalog discovery gate

ScheduledRetiredActorSpec.DiscoverDynamicTargetsAsync 现在只在 catalog 本身看起来退役时才执行:

  • 匹配退役 runtime type token,或
  • runtime type 不可用但 event stream 仍有事件

这避免了 warm clusters 在每次启动时都 replay agent-registry-store

2. 双重发现路径(read model + event stream)

// 先从 read model 读取(survives snapshot+compaction)
await foreach (var actorId in DiscoverFromReadModelAsync(services, ct))
    agentIds.Add(actorId);

// 再合并未投影的 event stream 条目
foreach (var actorId in await DiscoverFromCatalogEventsAsync(eventStore, ct))
    agentIds.Add(actorId);

这解决了之前 event-only walk 在 compaction 后丢失 entries 的问题。

3. 类型匹配逻辑提升到 RetiredActorTarget

RetiredActorTarget.MatchesRuntimeType() 现在包含边界感知的 ContainsTypeNameToken,spec gate 和 hosted service 共享同一个 matcher。

✅ 新增测试覆盖

测试 验证场景
ShouldSkipCatalogWalk_WhenCatalogRuntimeTypeIsAlreadyCurrent Warm cluster 不执行 catalog walk
ShouldDiscoverRetiredUserAgentsFromReadModel_WhenCatalogStreamHasBeenCompacted Compaction 后仍能从 read model 发现

⚠️ 仍然存在的限制

问题 优先级 状态
RetiredActorReadModelHelpers 跨模块依赖 P2 未修复(仍在 Channel.Runtime 命名空间)
Lease 无 heartbeat(300s 超时) P3 未修复
IHostedService 并行启动 P3 未修复

结论

这次 fix 解决了通用化改造中引入的 regression(catalog discovery gate 被意外删除)和 compaction 场景下的数据丢失问题。双重发现路径设计合理,测试覆盖充分。

LGTM,可以合并。建议在后续迭代中将 RetiredActorReadModelHelpers 移到 Foundation 层。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: Spec-driven refactor + latest fixes

这个重构把硬编码 cleanup 变成了通用 IRetiredActorSpec 插件架构,设计很好。逐一检查:


架构设计 — ✅ 优秀

  • IRetiredActorSpec / RetiredActorSpec 基类 + RetiredActorTarget value object 放在 Foundation.Abstractions
  • RetiredActorCleanupHostedService 放在 Foundation.Runtime.Hosting(不再是 Mainnet.Host 特有)
  • 每个模块自己注册 spec(TryAddEnumerable),Mainnet host 只需调 AddRetiredActorCleanup()
  • Runbook 里有 "Adding a New Retired-Actor Spec" 段落,写得很清楚
  • MatchesRuntimeType 从 hosted service 移到了 RetiredActorTarget record 上,职责归位

之前 review 的所有问题 — ✅ 全部保留

  • ClearStateAsync() 删除 grain storage record
  • CleanupOutgoingRelaysBestEffortAsync 清理 outgoing relays
  • boundary-aware type matching
  • 崩溃重跑 continue-reset 路径
  • ReadModel cleanup best-effort wrapper

新增亮点

  1. ReadModel discovery fallbackScheduledRetiredActorSpec.DiscoverDynamicTargetsAsync 先从 UserAgentCatalogDocument read model 读 agent ID,再从 event stream 补。这解决了 snapshot+compaction 后 event stream 里原始 UserAgentCatalogUpsertedEvent 丢失的问题。测试 ShouldDiscoverRetiredUserAgentsFromReadModel_WhenCatalogStreamHasBeenCompacted 覆盖了这个场景。

  2. Catalog gateShouldDiscoverFromCatalogAsync 检查 catalog runtime type 是否 retired。当 catalog 已经在新 namespace 时,不走 discovery(避免每次 startup 做无意义扫描)。测试 ShouldSkipCatalogWalk_WhenCatalogRuntimeTypeIsAlreadyCurrent 验证了这一点。

  3. Lease 语义改进:marker 从 Completed(永久)改为 Released(每次 startup 重新获取)。这意味着 cleanup 是幂等常驻的,不是一次性的——future spec 自动生效,不需要改 Mainnet host。

  4. 不相关的文件改动已移除:之前的 ConversationGAgent / ChannelConversationTurnRunner / 测试变更不在 PR 里了。干净的 diff。


需要注意的问题

1. 每次启动都跑所有 spec — 性能隐含假设

marker 从 Completed 改为 Released 意味着每次 pod startup 都会对每个 spec 获取 lease、probe 所有 targets。在已完全迁移的集群上,每个 target 的 GetRuntimeAgentTypeNameAsync 仍然会触发 grain activation(虽然结果是新 namespace 所以 skip)。

对于 3 个 spec × 6 个 hardcoded targets + catalog gate probe = ~7 次 grain activation per startup。可以接受。但如果未来 spec 增多或 target 列表变大,这个成本会线性增长。

建议: 这不是 blocking issue,但 runbook 里可以记录这个 startup cost 预期,方便未来排查启动延迟。

2. DiscoverFromReadModelAsyncdoc.Id 而不是专门的 agent ID 字段做过滤

// ScheduledRetiredActorSpec line ~175
foreach (var doc in result.Items)
{
    if (IsGeneratedUserAgent(doc.Id, doc.AgentType))
        yield return doc.Id.Trim();
}

查的是 ActorId == UserAgentCatalogGAgent.WellKnownIdUserAgentCatalogDocument,但返回的是 doc.Id。如果 UserAgentCatalogDocument.Id 恰好就是 agent ID(看起来是的,因为 catalog document 是 per-agent-entry),这就没问题。但这个隐含假设是 Id == AgentId,如果 document schema 将来改了(比如 Id 变成 composite key),这里会 break。

建议: 确认 UserAgentCatalogDocument.Id 确实就是 agent ID。如果不确定,用专门的 AgentIdEntryAgentId 字段更安全。

3. RetiredActorReadModelHelpers.DeleteByActorAsyncChannelRuntimeRetiredActorSpec.cs 里定义

这个 static helper 被 ChannelRuntimeRetiredActorSpecDeviceRetiredActorSpecScheduledRetiredActorSpec 共同使用。它定义在 ChannelRuntimeRetiredActorSpec.cs 里,Device 和 Scheduled 都依赖 Channel.Runtime 包来调用它。

如果 RetiredActorReadModelHelpers 是通用 helper,应该放到 Foundation.AbstractionsFoundation.Runtime.Hosting 里,而不是放在一个具体模块的 spec 文件中。否则 Device 和 Scheduled 对 Channel.Runtime 的依赖是隐含的。

建议:RetiredActorReadModelHelpers 移到 Foundation.AbstractionsFoundation.Runtime.Hosting


总结

这是一个很好的重构。架构通用化做得对,测试覆盖全面(9 个 test case 包括 compaction fallback、substring protection、multi-spec)。上面的 3 个注意点都不是 blocking issue,可以在 follow-up 里处理。

LGTM。

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: spec-driven refactor

The spec-driven architecture is the right direction. IRetiredActorSpec + RetiredActorTarget.MatchesRuntimeType + per-module specs + TryAddEnumerable is clean and extensible. The new DiscoverDynamicTargetsAsync with read-model fallback for post-compaction catalogs is a good defensive addition.

One remaining issue:

Incoming relay removal still not best-effort

In RetiredActorCleanupHostedService.CleanupTargetAsync:

// ❌ no error handling
if (!string.IsNullOrWhiteSpace(target.SourceStreamId))
{
    await _streamProvider
        .GetStream(target.SourceStreamId)
        .RemoveRelayAsync(target.ActorId, ct)
        .ConfigureAwait(false);
}

// ✅ has error handling
await CleanupOutgoingRelaysBestEffortAsync(spec, target.ActorId, ct).ConfigureAwait(false);

The outgoing path is wrapped in try-catch; the incoming path is not. When a root actor is cleaned first (removing its outgoing relay to the projection scope via CleanupOutgoingRelaysBestEffortAsync), the projection scope target then tries to RemoveRelayAsync the same (now-deleted) relay from the parent stream. If the IStream implementation throws on missing relays, this crashes pod startup.

Fix: wrap the incoming removal in try-catch, matching the outgoing pattern, or skip it when HasEventStreamAsync(sourceStreamId) returns false (parent already cleaned).

Minor: shared helper location

RetiredActorReadModelHelpers is in agents/Aevatar.GAgents.Channel.Runtime/ChannelRuntimeRetiredActorSpec.cs but is a cross-module utility (used by Device, Scheduled specs too). The existing using Aevatar.GAgents.Channel.Runtime references in those projects already exist, so it's not a new dependency — but consider moving it to Aevatar.Foundation.Abstractions.Maintenance alongside the other abstractions for clarity.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Final Review: Current State Assessment

PR 当前最新 commit: c3b96ec4

确认清单

检查项 状态 说明
通用化架构 IRetiredActorSpec + RetiredActorCleanupHostedService 设计合理
静态 targets 清理 ChannelRuntime/Device/Scheduled 各 2 个 targets
动态 targets 发现 Catalog read model + event stream 双重路径
Discovery gate 只在 catalog 退役时才执行 walk
类型匹配边界检查 RetiredActorTarget.MatchesRuntimeType 边界感知
崩溃重入安全 ResetWhenRuntimeTypeUnavailable + HasEventStreamAsync
Grain storage 清理 PurgeAsync 使用 ClearStateAsync
Outgoing relay 清理 CleanupOutgoingRelaysBestEffortAsync
ReadModel 容错 CleanupReadModelsBestEffortAsync
Lease 互斥 Per-spec marker stream + OCC
测试覆盖 新增 7 个测试,覆盖核心路径

剩余风险(非阻塞)

  1. IActorTypeProbe 可能触发 retired grain 激活失败

    • GetRuntimeAgentTypeNameAsync 会激活 grain 来读取 AgentTypeName
    • 如果 retired assembly 不存在,激活可能抛异常
    • 当前 CleanupTargetAsync 无 try-catch,异常会中断整个 spec
    • 缓解:cleanup 先于 module startup 执行,Orleans silo 可能未完全就绪
  2. CleanupTargetAsync 关键步骤无异常隔离

    • RemoveRelayAsyncDestroyAsyncResetStreamAsync 失败会中断 spec
    • 影响:lease 不释放,其他 pod 需等 300s 超时后重试
  3. RetiredActorReadModelHelpers 位置

    • 仍在 Aevatar.GAgents.Channel.Runtime 命名空间
    • Device/Scheduled 模块有编译时依赖

Verdict

Approve — 核心功能完整,架构可扩展,测试覆盖充分。

上述风险 1 和 2 属于极端边界情况,建议后续 PR 通过为 CleanupTargetAsync 添加 per-target try-catch 来加固。风险 3 建议将 helper 移到 Foundation 层。

当前 PR 可以安全合并。

The previous fix routed ScheduledRetiredActorSpec discovery through the
UserAgentCatalogDocument read model first to survive snapshot+compaction,
but a non-cancellation exception from QueryAsync would propagate to the
hosted service's StartAsync and abort pod startup — also skipping the
event-stream fallback and every static target cleanup that follows.

Wrap the read-model probe in a try/catch (excluding cancellation),
log a warning, and fall through to the catalog-event walk. Read-model
remains the fix for compacted catalogs; transient projection-store
failures merely degrade the discovery instead of blocking the host.

Adds a test that injects a throwing IProjectionDocumentReader and
asserts the static and event-stream-derived targets are still cleaned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: latest commit (7997680)

The DiscoverFromReadModelBestEffortAsync addition in ScheduledRetiredActorSpec is good — projection store unavailability during catalog discovery now degrades gracefully to event-stream walk instead of crashing startup.

The incoming relay removal issue in RetiredActorCleanupHostedService.CleanupTargetAsync is still unfixed:

// Still no try-catch
if (!string.IsNullOrWhiteSpace(target.SourceStreamId))
{
    await _streamProvider
        .GetStream(target.SourceStreamId)
        .RemoveRelayAsync(target.ActorId, ct)
        .ConfigureAwait(false);
}

This is the only remaining non-best-effort destructive path in the cleanup flow. If it's intentional (the IStream production implementation is known to be idempotent for missing relays), worth a comment documenting that assumption. Otherwise wrap it and resolve.

@eanzhao
Copy link
Copy Markdown
Contributor Author

eanzhao commented Apr 28, 2026

Review: best-effort read-model discovery

改动很小且正确:

  • DiscoverFromReadModelAsyncIAsyncEnumerable 改为 Task<IReadOnlyList>,包了一层 DiscoverFromReadModelBestEffortAsync catch 所有非取消异常,log warning 后 fallback 到空列表
  • 新测试 ShouldFallBackToCatalogEvents_WhenReadModelDiscoveryThrows 验证了:read-model 抛异常 → 仍然从 event stream 发现 skill-runner-recent → 正常 cleanup

这正好解决了我上次 review 提的 doc.Id 隐含假设问题的反面场景——如果 projection store 完全不可用,不会 block startup,而是优雅降级到 event stream walk。compact 过的集群丢掉这些 entry 是可接受的退化。

LGTM,ship it。

eanzhao and others added 2 commits April 28, 2026 20:58
CleanupTargetAsync still issued RemoveRelayAsync against the parent
SourceStreamId without exception handling, while every other relay /
read-model / outgoing-stream call in the cleanup pipeline is already
best-effort. A transient stream-provider failure on the incoming-relay
removal would have aborted the destroy + ResetStreamAsync the cleanup
exists to perform.

Wrap the call in CleanupIncomingRelayBestEffortAsync (matches the
existing CleanupOutgoingRelaysBestEffortAsync / CleanupReadModels…
pattern) — log a warning and continue.

Adds StartAsync_ShouldStillDestroyActor_WhenIncomingRelayRemovalThrows,
which injects a throwing parent stream and asserts the projection-scope
actor is still destroyed and its event stream reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two non-blocking review followups.

- Reject non-FQ tokens at the RetiredActorTarget call site. The matcher
  intentionally treats '.' as part of the token (boundary set is
  '[' / ']' / ',' / ' '), so a bare token like "GAgent" would match
  every type ending with that suffix in any namespace. Throw
  ArgumentException at construction so specs cannot ship a foot-gun.
- Distinguish the orphaned-event-stream recovery path from the normal
  retired-type-match path on cleanup logs via a structured
  cleanupReason field (retired-type-match | orphaned-event-stream),
  so dashboards can count the recovery path separately.

Adds RetiredActorTargetTests covering FQ acceptance, bare/empty/
whitespace token rejection, and the boundary-aware proxy-vs-actual
match assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant