Fix retired ChannelRuntime startup cleanup#495
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0776a16a69
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| var runtimeTypeName = await _typeProbe | ||
| .GetRuntimeAgentTypeNameAsync(target.ActorId, ct) | ||
| .ConfigureAwait(false); | ||
| if (!MatchesRetiredType(target, runtimeTypeName)) | ||
| return; |
There was a problem hiding this comment.
Continue reset path on reruns after partial cleanup
This early return makes the cleanup non-idempotent across crashes: if a pod destroys a retired actor but dies before ResetStreamAsync or lease completion, the next pod will read a now-empty runtime type, hit this return, and permanently skip stream reset for that actor. That leaves stale committed events in the store, so recreated actors can replay old ChannelRuntime history/state even though the migration marker eventually completes.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## dev #495 +/- ##
==========================================
+ Coverage 71.04% 71.06% +0.02%
==========================================
Files 1230 1235 +5
Lines 89062 89463 +401
Branches 11654 11705 +51
==========================================
+ Hits 63270 63573 +303
- Misses 21256 21332 +76
- Partials 4536 4558 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
eanzhao
left a comment
There was a problem hiding this comment.
Reviewed the cleanup path. I found one blocker for the old-agent cleanup claim and left an inline comment.
| "Aevatar.GAgents.ChannelRuntime.AgentRegistryMaterializationContext", | ||
| ], | ||
| SourceStreamId: "agent-registry-store"), | ||
| ]; |
There was a problem hiding this comment.
这个 target list 还不能清理旧的 user agents。这里只覆盖了 3 个 well-known store actor 和 3 个 projection scope actor,但历史 Aevatar.GAgents.ChannelRuntime.SkillRunnerGAgent / WorkflowAgentGAgent 这类按 skill-runner-*、workflow-agent-* 生成的 actor 不会被枚举、DestroyAsync、ResetStreamAsync,也不会 PurgeActorAsync 掉它们的 durable callbacks。结果是 catalog/read model 被清空后,旧 agent grain 仍然留着 retired AgentTypeName;后续 reminder 触发或按旧 agent_id 调用 command port 时,GetAsync 会把它当作已初始化 actor 返回,dispatch 仍会落到无法 resolve 的旧 runtime type 上。需要在删除/重置 catalog 之前拿到旧 catalog 里的 agent ids,并对这些旧 agent 做 destroy + callback purge + event stream reset,或者明确说明本 migration 只清 store/projection,不保证清 old agents。
Review: Retired ChannelRuntime startup cleanupOverall direction is correct — probe → match retired type → destroy → reset stream is the right flow. A few issues that need addressing before landing: 1. Read model cleanup is unreachable at startup
The second case is the dangerous one — Recommendation: Either:
2. Read model cleanup has zero test coverageBoth test cases ( Given issue #1 above, this path needs a test that:
3. Garnet
|
Review SummaryThe approach is sound for the startup-blocking problem (projection scopes, registration stores). The probe → destroy → reset → cleanup chain is correct. But there are gaps in coverage. Issue 1:
|
| Actor | ID Pattern | In Cleanup? |
|---|---|---|
ChannelBotRegistrationGAgent |
channel-bot-registration-store |
✅ |
DeviceRegistrationGAgent |
device-registration-store |
✅ |
UserAgentCatalogGAgent |
agent-registry-store |
✅ |
AgentRegistryGAgent |
agent-registry-store |
✅ |
ChannelUserGAgent |
channel-user-{platform}-{registrationId}-{senderId} |
❌ |
SkillRunnerGAgent |
dynamic | ❌ |
WorkflowAgentGAgent |
dynamic | ❌ |
ChannelUserGAgent is the most significant — it is a per-user, per-channel actor with potentially many instances in production. Each one has event-sourced state (ChannelUserState) persisted in both Orleans grain storage and the Garnet event store.
These won't cause startup failure (they are not activated at startup), but they will fail when a user interacts with an old channel session, and they leave orphaned data in Redis forever.
Recommendation: Add a Redis SCAN-based discovery step for the channel-user-* prefix, or document that these will be garbage-collected through natural TTL / manual intervention.
Issue 2: The PR description claims "removes retired root/projection actors" — scope is narrower
The PR description says it "removes retired root/projection actors, removes stale relays, clears stale read models, resets retired actor streams." This is accurate for the 6 hardcoded targets. But operators reading the runbook might assume ALL old ChannelRuntime actors are cleaned up, which is not the case.
The runbook should explicitly state that only the 6 well-known projection/registration actors are covered, and that dynamic per-user actors (ChannelUserGAgent) are out of scope.
Issue 3: Orleans grain state key is emptied, not deleted
PurgeAsync() writes zeroed-out state back to Redis rather than deleting the key. This is functionally correct (the AgentTypeName is null, so Orleans will not try to instantiate the old type), but leaves empty stub keys in Redis. Minor concern — worth noting for operators who want a clean Redis.
What works correctly
IActorTypeProbereads persistedAgentTypeNamefrom grain state without needing to instantiate the agent — correct approachDestroyAsync→PurgeAsyncclearsAgentTypeNameand writes to storage — prevents future activation failuresResetStreamAsyncdeletes all 3 Garnet keys per agent (version, index, data) — completeRemoveRelayAsyncproperly cleans upStreamTopologyGrainpersisted state- Lease mechanism with stale detection is well-designed for multi-pod safety
- Ordering guarantee: cleanup runs before ChannelRuntime/Device/Scheduled projection startup services (verified by test)
- Read model cleanup uses
ActorIdfilter — correctly scoped to the target actor
Verdict
The PR solves the startup-blocking problem — the 6 projection/registration actors that cause pod startup abort will be cleaned up correctly. But it does not cover ChannelUserGAgent and other dynamic actors, which will remain orphaned. Whether this is acceptable depends on whether production has active ChannelUserGAgent instances from before the namespace split.
Review Summary能解决问题吗?能,但覆盖范围有限。这个 PR 只清理了 6 个硬编码的 root/store actor,对于由这些 actor 动态创建的 child actors(如实际的 channel bot 实例)没有任何清理逻辑。如果集群中存在大量动态 actor,它们仍然会在后续 Orleans 激活时触发类型解析失败。 关键问题
建议
|
Review: 这个 PR 不能可靠地清理 old agents我仔细 trace 了所有运行时依赖,发现几个根本性问题: P0: 用
|
Fix ReviewThe new commit ( What is fixed
Remaining gap:
|
Follow-up reviewAll four issues from the previous review are addressed:
New additions look correct:
One minor observation: |
Review of fix (commit 3/3)修了不少,核心逻辑有明显改善。逐一对照上次 review: P0: 部分清理后重跑 — ✅ 已修复
if (!matchesRetiredRuntimeType)
{
if (!string.IsNullOrWhiteSpace(runtimeTypeName))
return;
shouldContinueReset = target.ResetWhenRuntimeTypeUnavailable &&
await HasEventStreamAsync(target.ActorId, ct);
if (!shouldContinueReset)
return;
}当 P0: IActorTypeProbe 激活 retired grain —
|
| 问题 | 状态 |
|---|---|
| P0: 崩溃重跑跳过 reset | ✅ 已修 |
| P0: type probe 触发激活 | |
| P1: storage record 未删 | ❌ 未修(需改核心 grain) |
| P1: StreamTopologyGrain 残留 | ❌ 未修 |
| 动态 user agent 发现 | ✅ 新增,基本正确 |
| ReadModel cleanup 容错 | ✅ 已修 |
建议: P0 activation 问题至少需要在 runbook 里记录风险。StreamTopologyGrain 清理可以在本 PR 里加上(调用 _streamProvider.GetStream(target.ActorId).ClearAsync() 或等效操作)。其余可以 ack 后 merge。
Re-review: Fix 后评估 ✅核心问题修复情况能清理 old agents 吗?基本能。
关键改进
还缺的测试建议补充一个专门的单元测试验证
目前测试里只有正向 cases,缺少 negative cases。 结论这个 fix 把 PR 从"缓解措施"提升到了"生产可用"。动态 actor 清理 + 重入安全 + 宽松失败处理,已经能覆盖 ChannelRuntime 退役的主要残留场景。建议合并前补上 |
eanzhao
left a comment
There was a problem hiding this comment.
发现一个仍会漏清 old agents 的场景,已 inline 标出。
|
|
||
| private async Task<IReadOnlyList<string>> DiscoverCatalogUserAgentIdsAsync(CancellationToken ct) | ||
| { | ||
| var events = await _eventStore |
There was a problem hiding this comment.
这里还有一个 blocker:只扫 agent-registry-store 的 event stream 在生产里不可靠。
UserAgentCatalogGAgent 是 GAgentBase<UserAgentCatalogState>,而 runtime 默认开启 snapshot + compaction(EnableSnapshots=true, SnapshotInterval=200, EnableEventCompaction=true, RetainedEventsAfterSnapshot=0)。actor deactivation 后当前 catalog state 会写入 RuntimeActorGrainState.AgentStateSnapshot,随后 compaction 会删除 snapshot version 及以前的 events。
这种情况下 GetEventsAsync(agent-registry-store) 只能看到 snapshot 之后的增量;snapshot 里已有、但近期未变更的 skill-runner-* / workflow-agent-* 不会被发现。后面本 service 又会 destroy/reset catalog,把 snapshot 和剩余 stream 都清掉,漏掉的 old agent id 就没法再恢复或清理。
需要在 reset catalog 之前同时读取 catalog snapshot/current state(或可靠 read model)并合并剩余 events,最好补一个 “catalog 已 snapshot+compaction 后仍能发现 old agents” 的测试。否则这个 fix 仍不能保证清理 old agents。
Extract the one-off ChannelRuntime cleanup into a generic, spec-driven
mechanism that runs on every pod startup and is intrinsically idempotent.
- IRetiredActorSpec abstraction in Aevatar.Foundation.Abstractions.Maintenance:
modules declare retired type tokens, well-known targets, optional dynamic
discovery, and optional read-model cleanup.
- Generic RetiredActorCleanupHostedService in Aevatar.Foundation.Runtime.Hosting
iterates every registered IRetiredActorSpec under a per-spec lease
(__maintenance:retired-actor-cleanup:{specId}). No "completed forever"
marker — each spec runs every startup and is a no-op when no targets
match.
- ChannelRuntime/Device/Scheduled each ship their own IRetiredActorSpec
alongside their DI extension; the user-agent catalog walker lives with
Scheduled. Adding a new retired type only requires registering a new
spec — Mainnet.Host.Api stays untouched.
- Mainnet.Host.Api wires AddRetiredActorCleanup() in place of the old
module-specific hosted service.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix Review (
|
Review of latest commits (generic cleanup)对比上次 review 的问题逐条检查: P1: PurgeAsync storage record — ✅ 已修- _state.State.AgentId = string.Empty;
- _state.State.AgentTypeName = null;
- ...
- await _state.WriteStateAsync();
+ _state.State = new RuntimeActorGrainState();
+ await _state.ClearStateAsync();
P1: StreamTopologyGrain outgoing relays — ✅ 已修新增了 var relays = await stream.ListRelaysAsync(ct);
foreach (var relay in relays)
await stream.RemoveRelayAsync(relay.TargetStreamId, ct);会清理被销毁 actor 的 outgoing forwarding binding。best-effort 包装,失败不阻塞。 测试也验证了: Type matching 精确度 — ✅ 已修
新增测试 Runbook 更新 — ✅加了 "Out of Scope" section 明确说明 新增问题1. line 215 调用 测试里 stub 不激活真实 grain 所以不会触发这个问题。但生产 Orleans 环境下,这是唯一一个需要在 discovery 阶段就 probe 的 target(其他 target 的 probe 在 建议: 在 2. var typeProbe = new StubActorTypeProbe(new Dictionary<string, string?>
{
["channel-bot-registration-store"] = "...GAgentProxy,...",
});其他 5 个 hardcoded target + catalog 都不在 typeProbe 里,会返回 null → 走 也就是说这个测试之所以 pass,是因为除了 这不是 bug,但测试的隐含前提比较脆弱。 3. Marker stream 永不清理
低优先级,不影响正确性。 总结上次提的 4 个 P0/P1 全部修复,cleanup 现在可以可靠地:
剩余的风险是 discovery 阶段的 catalog grain probe 可能在生产环境失败。建议加个 catch + fallback,或者确认 整体 LGTM,可以 merge。 |
通用化改造 Review整体架构设计合理,从硬编码清理演进到了插件化的 spec-driven 框架。以下是需要关注的问题: 🔴 必须修复1.
建议:把这个 helper 移到 2. 动态发现 + 崩溃重入的竞态窗口
建议:在 🟡 建议改进3. Lease 超时与清理时长不匹配
建议:
4.
建议:提供一个更高层的抽象,如 5. 缺少分布式场景测试 现有测试覆盖了:
但缺少:
✅ 设计亮点
结论通用化方向正确,架构干净。建议先修复问题 1(helper 位置)和问题 2(动态发现重入安全),这两个是生产环境风险点。问题 3 和 4 可以作为后续优化。 |
eanzhao
left a comment
There was a problem hiding this comment.
通用化后的最近提交我看了,仍有一个会影响 old agents 清理完整性的 blocker,已 inline 标出。
| IEventStore eventStore, | ||
| CancellationToken ct) | ||
| { | ||
| var events = await eventStore |
There was a problem hiding this comment.
这里的 blocker 还在,只是从 hosted service 搬到了 Scheduled spec:只读 agent-registry-store 的 event stream 不能可靠发现 old generated agents。
UserAgentCatalogGAgent 是 event-sourced actor,runtime 默认开启 snapshot + compaction(EnableSnapshots=true, SnapshotInterval=200, EnableEventCompaction=true, RetainedEventsAfterSnapshot=0)。catalog 当前 entries 可能已经只在 RuntimeActorGrainState.AgentStateSnapshot 中,snapshot version 及以前的 upsert/tombstone events 会被 compaction 删除。这样 GetEventsAsync(UserAgentCatalogGAgent.WellKnownId) 只能看到 snapshot 之后的增量,snapshot 里已有但近期没变更的 skill-runner-* / workflow-agent-* 会被漏掉。
由于这个 spec 后续还会 destroy/reset agent-registry-store,漏掉的 actor id 会随 catalog snapshot/stream 一起消失,之后就无法再自动清理。需要在 reset catalog 之前读取 catalog current state/snapshot 并合并剩余 events,或者改用可靠的 read model/专用 maintenance reader;同时补一个“catalog 已 snapshot+compaction 后仍能发现 old generated agents”的测试。
| [EnumeratorCancellation] CancellationToken ct) | ||
| { | ||
| var eventStore = services.GetRequiredService<IEventStore>(); | ||
| var agentIds = await DiscoverCatalogUserAgentIdsAsync(eventStore, ct).ConfigureAwait(false); |
There was a problem hiding this comment.
这里把原来的触发条件丢掉了:DiscoverDynamicTargetsAsync 现在每次启动都会无条件 replay agent-registry-store,然后对 catalog 里的每个 skill-runner-* / workflow-agent-* actor 做 type probe。对于已经迁完、catalog runtime type 已经是新命名空间的集群,这会在每次 pod 启动时扫描整个 catalog 历史并激活/探测大量当前 actor;catalog 越大,启动成本和噪声越高。之前的实现只在 agent-registry-store 自身还是 retired type,或 catalog state 已清但 stream 仍存在的 partial-cleanup 场景下才做动态发现。建议把这个 gate 放回 spec 层,或者让 generic cleanup 在调用 dynamic discovery 前先按 spec 的 catalog/root target 判断是否需要发现。
通用化改造 Review - 最新提交架构评估通用化方向正确,从硬编码的 ChannelRuntime 清理演进到了插件化的 上一次 Review 问题的修复状态
新增的观察1. ✅ Marker 语义改进 2. ✅ // Before: 手动清空字段
_state.State.AgentId = string.Empty;
// ...
await _state.WriteStateAsync();
// After: 使用 ClearStateAsync
_state.State = new RuntimeActorGrainState();
await _state.ClearStateAsync();这对 Orleans 的持久化语义更正确。 3. 但注意: 建议:如果必须确保 cleanup 先于各模块 startup,需要显式同步机制(如 代码层面的问题4. private static bool IsTypeNameBoundary(char value) =>
value is '[' or ']' or ',' or ' ';如果退役类型名是另一个类型的前缀,仍可能误杀。例如:
V2 前面没有边界字符(是
这就危险了。好在目前的 token 都是完整的类名,不是前缀。 建议:在 5. public override async IAsyncEnumerable<RetiredActorTarget> DiscoverDynamicTargetsAsync(...)
{
var eventStore = services.GetRequiredService<IEventStore>();
var agentIds = await DiscoverCatalogUserAgentIdsAsync(eventStore, ct).ConfigureAwait(false);
// ...
}如果 建议:添加 try/catch 包裹,或在基类 结论通用化架构设计良好,但上一次提出的 4 个生产环境问题均未修复。建议:
其余部分可以合并。 |
Review: generalization roundThe additions look solid overall: outgoing relay cleanup, boundary-aware type token matching, catalog user agent discovery, and partial-cleanup resilience all make sense. But there are two relay-cleanup issues that can crash pod startup: 1. Incoming relay removal is not best-effort — can crash the podIn // ⚠️ NO error handling
if (!string.IsNullOrWhiteSpace(target.SourceStreamId))
{
await _streamProvider
.GetStream(target.SourceStreamId)
.RemoveRelayAsync(target.ActorId, ct)
.ConfigureAwait(false);
}
// ✅ wrapped in try-catch
await CleanupOutgoingRelaysBestEffortAsync(target.ActorId, ct).ConfigureAwait(false);The outgoing relay cleanup is wrapped in try-catch ( Fix: wrap the incoming removal similarly — either extract a 2. Double relay cleanup on projection scope targetsRoot actors (e.g. Then when the projection scope target runs, Fix: either make incoming relay removal best-effort, or skip 3. Minor:
|
Final Review: 通用化改造完成评估经过多轮 review,这个 PR 的核心功能已经成熟。以下是最终评估: ✅ 已解决的问题(生产可用)
|
| 问题 | 优先级 | 说明 |
|---|---|---|
RetiredActorReadModelHelpers 跨模块依赖 |
P2 | 在 Channel.Runtime 命名空间下,被 Device/Scheduled 引用 |
| 动态发现无缓存 | P2 | catalog stream 被重置后,重入 pod 无法发现动态 targets |
| Lease 无 heartbeat | P3 | 300s 超时,大量 targets 时可能被其他 pod 抢走 |
IEventStore 在 Spec 中直接使用 |
P3 | 违反分层原则,但清理场景可接受 |
IHostedService 并行启动 |
P3 | ASP.NET Core 并行启动,不保证 cleanup 先于 module startup |
📋 测试覆盖
- ✅
RetiredActorCleanupHostedServiceTests(551 行) 覆盖:- 退役 actor 清理(relay + stream reset)
- 当前类型跳过
- 边界匹配(
GAgentProxy不命中) - 已销毁 actor 的 stream 续清理
- Catalog 动态发现(skill-runner/workflow-agent)
- ReadModel 删除与容错
- 多 spec 并发执行
- ✅
MainnetHostCompositionTests验证 cleanup hosted service 注册顺序 - ✅
GarnetEventStoreIntegrationTests验证ResetStreamAsync - ✅
OrleansDistributedCoverageTests验证ClearStateAsync
🎯 结论
Approve — 这个 PR 解决了 ChannelRuntime 退役导致的 startup blocking 问题,通用化架构为未来的退役清理提供了可扩展的框架。
建议合并后在后续迭代中处理 P2/P3 的优化项,特别是:
- 将
RetiredActorReadModelHelpers移到 Foundation 层 - 为动态发现结果添加缓存机制(或文档化限制)
- Restore the catalog discovery gate that the refactor accidentally dropped: ScheduledRetiredActorSpec.DiscoverDynamicTargetsAsync now runs only when the catalog itself still looks retired (matches a retired runtime-type token, or runtime type unavailable but stream still has events). Warm clusters no longer replay agent-registry-store on every startup. - Read generated agent ids from UserAgentCatalogDocument first, then merge catalog upsert events. The previous event-only walk dropped entries that had been swallowed by snapshot + compaction; the read-model survives that case. - Hoist the type-token matching logic onto RetiredActorTarget so the hosted service and spec gate share one boundary-aware matcher. - Tests: catalog walk skipped when runtime type is current; ids discovered from the read model when the catalog event stream has been compacted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| string? cursor = null; | ||
| do | ||
| { | ||
| var result = await reader.QueryAsync( |
There was a problem hiding this comment.
这个 read-model 补充发现最好做成 best-effort。现在 reader.QueryAsync(...) 抛出的非 cancellation 异常会从 DiscoverDynamicTargetsAsync 直接冒到 RunSpecAsync,而 hosted service 外层只吞 OperationCanceledException,结果是 projection store 临时不可用/查询不支持/映射异常时,启动清理本身会把 pod 启动打挂,并且 event-stream fallback 和后面的 static target cleanup 都不会执行。这里的 read-model path 是为 snapshot+compaction 补洞,建议 catch 非 cancellation 异常后记录 warning,然后继续跑 DiscoverFromCatalogEventsAsync;这样未 compaction 的集群仍可清理,compaction 场景只是降级而不是阻断启动。
Fix Review: Address spec-driven cleanup review comments ✅Commit ✅ 已修复的问题1. 恢复了 catalog discovery gate
这避免了 warm clusters 在每次启动时都 replay 2. 双重发现路径(read model + event stream) // 先从 read model 读取(survives snapshot+compaction)
await foreach (var actorId in DiscoverFromReadModelAsync(services, ct))
agentIds.Add(actorId);
// 再合并未投影的 event stream 条目
foreach (var actorId in await DiscoverFromCatalogEventsAsync(eventStore, ct))
agentIds.Add(actorId);这解决了之前 event-only walk 在 compaction 后丢失 entries 的问题。 3. 类型匹配逻辑提升到 RetiredActorTarget
✅ 新增测试覆盖
|
| 问题 | 优先级 | 状态 |
|---|---|---|
RetiredActorReadModelHelpers 跨模块依赖 |
P2 | 未修复(仍在 Channel.Runtime 命名空间) |
| Lease 无 heartbeat(300s 超时) | P3 | 未修复 |
IHostedService 并行启动 |
P3 | 未修复 |
结论
这次 fix 解决了通用化改造中引入的 regression(catalog discovery gate 被意外删除)和 compaction 场景下的数据丢失问题。双重发现路径设计合理,测试覆盖充分。
LGTM,可以合并。建议在后续迭代中将 RetiredActorReadModelHelpers 移到 Foundation 层。
Review: Spec-driven refactor + latest fixes这个重构把硬编码 cleanup 变成了通用 架构设计 — ✅ 优秀
之前 review 的所有问题 — ✅ 全部保留
新增亮点
需要注意的问题1. 每次启动都跑所有 spec — 性能隐含假设 marker 从 对于 3 个 spec × 6 个 hardcoded targets + catalog gate probe = ~7 次 grain activation per startup。可以接受。但如果未来 spec 增多或 target 列表变大,这个成本会线性增长。 建议: 这不是 blocking issue,但 runbook 里可以记录这个 startup cost 预期,方便未来排查启动延迟。 2. // ScheduledRetiredActorSpec line ~175
foreach (var doc in result.Items)
{
if (IsGeneratedUserAgent(doc.Id, doc.AgentType))
yield return doc.Id.Trim();
}查的是 建议: 确认 3. 这个 static helper 被 如果 建议: 把 总结这是一个很好的重构。架构通用化做得对,测试覆盖全面(9 个 test case 包括 compaction fallback、substring protection、multi-spec)。上面的 3 个注意点都不是 blocking issue,可以在 follow-up 里处理。 LGTM。 |
Review: spec-driven refactorThe spec-driven architecture is the right direction. One remaining issue: Incoming relay removal still not best-effortIn // ❌ no error handling
if (!string.IsNullOrWhiteSpace(target.SourceStreamId))
{
await _streamProvider
.GetStream(target.SourceStreamId)
.RemoveRelayAsync(target.ActorId, ct)
.ConfigureAwait(false);
}
// ✅ has error handling
await CleanupOutgoingRelaysBestEffortAsync(spec, target.ActorId, ct).ConfigureAwait(false);The outgoing path is wrapped in try-catch; the incoming path is not. When a root actor is cleaned first (removing its outgoing relay to the projection scope via Fix: wrap the incoming removal in try-catch, matching the outgoing pattern, or skip it when Minor: shared helper location
|
Final Review: Current State AssessmentPR 当前最新 commit: 确认清单
剩余风险(非阻塞)
VerdictApprove — 核心功能完整,架构可扩展,测试覆盖充分。 上述风险 1 和 2 属于极端边界情况,建议后续 PR 通过为 当前 PR 可以安全合并。 |
The previous fix routed ScheduledRetiredActorSpec discovery through the UserAgentCatalogDocument read model first to survive snapshot+compaction, but a non-cancellation exception from QueryAsync would propagate to the hosted service's StartAsync and abort pod startup — also skipping the event-stream fallback and every static target cleanup that follows. Wrap the read-model probe in a try/catch (excluding cancellation), log a warning, and fall through to the catalog-event walk. Read-model remains the fix for compacted catalogs; transient projection-store failures merely degrade the discovery instead of blocking the host. Adds a test that injects a throwing IProjectionDocumentReader and asserts the static and event-stream-derived targets are still cleaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review: latest commit (7997680)The The incoming relay removal issue in // Still no try-catch
if (!string.IsNullOrWhiteSpace(target.SourceStreamId))
{
await _streamProvider
.GetStream(target.SourceStreamId)
.RemoveRelayAsync(target.ActorId, ct)
.ConfigureAwait(false);
}This is the only remaining non-best-effort destructive path in the cleanup flow. If it's intentional (the |
Review: best-effort read-model discovery改动很小且正确:
这正好解决了我上次 review 提的 LGTM,ship it。 |
CleanupTargetAsync still issued RemoveRelayAsync against the parent SourceStreamId without exception handling, while every other relay / read-model / outgoing-stream call in the cleanup pipeline is already best-effort. A transient stream-provider failure on the incoming-relay removal would have aborted the destroy + ResetStreamAsync the cleanup exists to perform. Wrap the call in CleanupIncomingRelayBestEffortAsync (matches the existing CleanupOutgoingRelaysBestEffortAsync / CleanupReadModels… pattern) — log a warning and continue. Adds StartAsync_ShouldStillDestroyActor_WhenIncomingRelayRemovalThrows, which injects a throwing parent stream and asserts the projection-scope actor is still destroyed and its event stream reset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two non-blocking review followups. - Reject non-FQ tokens at the RetiredActorTarget call site. The matcher intentionally treats '.' as part of the token (boundary set is '[' / ']' / ',' / ' '), so a bare token like "GAgent" would match every type ending with that suffix in any namespace. Throw ArgumentException at construction so specs cannot ship a foot-gun. - Distinguish the orphaned-event-stream recovery path from the normal retired-type-match path on cleanup logs via a structured cleanupReason field (retired-type-match | orphaned-event-stream), so dashboards can count the recovery path separately. Adds RetiredActorTargetTests covering FQ acceptance, bare/empty/ whitespace token rejection, and the boundary-aware proxy-vs-actual match assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem
Persisted runtime actors from the retired
Aevatar.GAgents.ChannelRuntimeassembly can be activated during pod startup before the new registration projection startup path rebuilds them. Orleans then fails to resolve old agent/materialization-context type names and the host can abort startup.Solution
RetiredChannelRuntimeActorCleanupHostedServicein Mainnet host composition. It takes a one-time event-store marker lease, probes persisted actor runtime types, removes retired root/projection actors, removes stale relays, clears stale read models, resets retired actor streams, and writes a completion marker.Impact Paths
src/Aevatar.Mainnet.Host.Api/Hosting/Migration/*src/Aevatar.Foundation.*/*EventStore*src/Aevatar.Mainnet.Host.Api/Hosting/MainnetHostBuilderExtensions.csdocs/operations/2026-04-28-retired-channelruntime-cleanup-runbook.mdValidation
dotnet test test/Aevatar.Foundation.Core.Tests/Aevatar.Foundation.Core.Tests.csproj --nologo --filter "ResetStreamAsync"dotnet test test/Aevatar.Hosting.Tests/Aevatar.Hosting.Tests.csproj --nologo --filter "RetiredChannelRuntimeActorCleanupHostedServiceTests|AddAevatarMainnetHost_ShouldRunRetiredChannelRuntimeCleanup"bash tools/ci/test_stability_guards.shbash tools/ci/query_projection_priming_guard.sh