Reset Orleans pub/sub rendezvous state during retired-actor cleanup#501
Reset Orleans pub/sub rendezvous state during retired-actor cleanup#501
Conversation
Stale entries left in Orleans's PubSubRendezvousGrain Redis state survive GAgent + event-stream cleanup. The next silo wave then fails RegisterAsStreamProducer with InconsistentStateException, blocking the projection pipeline that depends on that stream — the bug behind the Lark bot relay 401 (ChannelBotRegistrationDocument projection never materialized → canonical scope id unresolvable). Adds an IStreamPubSubMaintenance abstraction with an Orleans+Redis implementation that deletes the rendezvous redis key. The retired-actor cleanup invokes it after destroy + event-stream reset for every cleaned actor, and ProjectionScopeActorRuntime now self-heals on type mismatch (destroy → reset pub/sub → recreate) instead of throwing. Also extends ScheduledRetiredActorSpec to target the new user-agent-catalog-read-model scope key so mid-migration deploys auto-recover. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| typeof(TScopeAgent).FullName); | ||
|
|
||
| await _runtime.DestroyAsync(actorId, ct).ConfigureAwait(false); | ||
| if (_streamPubSubMaintenance != null) |
There was a problem hiding this comment.
Blocking concern: this reset only runs after an existing actor is found with a stale runtime type; the missing-actor branch above still creates directly without clearing pub/sub. That leaves the exact already-broken state from the PR description unrecovered: an earlier cleanup can have destroyed the actor and reset the event stream, but left the PubSubRendezvousGrain Redis key behind. On the next deploy, RetiredActorCleanupHostedService returns before calling ResetActorStreamPubSubAsync because runtimeTypeName is empty and HasEventStreamAsync is false, then EnsureExistsAsync sees the actor as missing and calls CreateAsync directly. That first CreateAsync can still hit the stale RegisterAsStreamProducer ETag conflict, so a restart after this PR may not recover production without manual Redis cleanup.
Please cover the pub/sub-only orphan path as well, e.g. reset pub/sub before CreateAsync when the projection scope actor is missing, or make retired cleanup explicitly reset pub/sub for declared targets even when type + event stream are both gone. Add a regression test for runtime type unavailable + event stream version 0 + stale pub/sub state.
|
LGTM ✅ 代码审查结果:
验证结果:
Build succeeded. /Users/chronoai/.paseo/worktrees/32qtkk8z/feeble-chipmunk/test/Aevatar.GAgentService.Tests/Aevatar.GAgentService.Tests.csproj : warning NU1902: Package 'OpenTelemetry.Api' 1.15.0 has a known moderate severity vulnerability, GHSA-g94r-2vxg-569j [/Users/chronoai/.paseo/worktrees/32qtkk8z/feeble-chipmunk/aevatar.slnx] Time Elapsed 00:00:17.71 ✅ 0 errors
建议合并,可以解决生产环境的 Orleans PubSubRendezvousGrain 状态泄漏问题。 |
eanzhao
left a comment
There was a problem hiding this comment.
Review Summary
Root cause analysis is correct: RetiredActorCleanupHostedService was destroying the actor + resetting event streams but leaving behind Orleans PubSubRendezvousGrain Redis state with a stale etag, which blocks subsequent RegisterAsStreamProducer with InconsistentStateException. The fix is cleanly structured and well-tested.
What works well
-
IStreamPubSubMaintenanceabstraction — clean separation between the cleanup concern and the specific backend (Orleans+Redis). The optional registration (TryAddSingletononly under Garnet) means in-memory/mock builds won't break. -
Two-pronged fix — (a)
RetiredActorCleanupHostedServiceresets pub/sub state after destroy/stream reset for all cleaned actors, and (b)ProjectionScopeActorRuntime.EnsureExistsAsyncself-heals stale-type actors by destroy→pub/sub reset→recreate instead of throwingInvalidOperationException. Both paths needed for full coverage. -
Redis key format pinned via reflection test —
OrleansRedisStreamPubSubMaintenanceTests.ComputeOrleansDefaultStorageKeyinvokesRedisGrainStorage.DefaultGetStorageKeyvia reflection. This ensures any future Orleans upgrade that changes the key format surfaces as a test failure rather than a silent state leak. -
Mid-migration
ScheduledRetiredActorSpecentry — the new spec target at theDurableProjectionKindscope key with oldChannelRuntimetype markers handles the exact scenario where a mid-migration deploy orphaned an actor with the new key but old type binding. Well commented. -
Graceful degradation —
IStreamPubSubMaintenanceis optional everywhere.CleanupStreamPubSubBestEffortAsynccatches all non-cancellation exceptions. Self-heal inEnsureExistsAsyncfalls through to destroy+recreate even without pub/sub maintenance. -
Test coverage is thorough — 10 new tests across 3 test projects covering: type mismatch self-heal, pub/sub reset integration in cleanup service, pub/sub reset failure tolerance, Redis key format verification, and Redis connectivity failure handling.
Verification
- Build: 0 errors
Aevatar.CQRS.Projection.Core.Tests: 111 pass, 0 failAevatar.Foundation.Runtime.Hosting.Tests: 169 pass, 0 failAevatar.Hosting.Tests: 49 pass, 0 fail- Architecture guards: all pass
The approach should resolve the cascading failure chain (stale pub/sub etag → projection pipeline blocked → read model missing → relay 401 → Lark bot reply undelivered).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b79fa9bac8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await _runtime.DestroyAsync(actorId, ct).ConfigureAwait(false); | ||
| if (_streamPubSubMaintenance != null) | ||
| await _streamPubSubMaintenance.ResetActorStreamPubSubAsync(actorId, ct).ConfigureAwait(false); | ||
|
|
||
| _ = await _runtime.CreateAsync<TScopeAgent>(actorId, ct).ConfigureAwait(false); |
There was a problem hiding this comment.
Reset scope stream before recreating mismatched actor
EnsureExistsAsync destroys and recreates a mismatched projection scope actor but never clears that actor’s durable event stream. In this codebase, DestroyAsync only purges actor state/cache and does not call IEventStoreMaintenance.ResetStreamAsync, so any queued events from the previous type incarnation can be replayed into the newly created scope and re-trigger startup failures during type-migration recovery. The retired-actor cleanup path already performs stream reset explicitly; this self-heal path should do the same before CreateAsync.
Useful? React with 👍 / 👎.
eanzhao
left a comment
There was a problem hiding this comment.
Overall
LGTM. This is a well-scoped fix for a real production issue. The IStreamPubSubMaintenance abstraction keeps the infra-specific key-formatting isolated, and the self-heal path in ProjectionScopeActorRuntime correctly orders destroy -> pub/sub reset -> recreate.
Minor points
-
Key-format coupling: The
pubsubrendezvousgrain type string and the{grainId}/{serviceId}Redis key format are tightly coupled to Orleans 10.0.x internals. The reflection-based test guard is good, but consider adding an inline comment inOrleansRedisStreamPubSubMaintenancewarning that upgrading Orleans requires re-verifying this format. -
ProjectionScopeActorRuntime self-heal timing: In
EnsureExistsAsync, afterDestroyAsyncand beforeCreateAsync, there is no delay or retry. IfResetActorStreamPubSubAsyncraces with a concurrent silo trying toRegisterAsStreamProducerfor the same actor, the concurrent silo could re-create the stale rendezvous entry before ourCreateAsync. This is probably acceptable because (a) the scope actor ID is unique per silo and (b) the cleanup is single-writer for a given scope key, but worth keeping in mind. -
Logging: The
LogWarninginCleanupStreamPubSubBestEffortAsyncwhen pub/sub reset fails is appropriate, but you might also want to log when_streamPubSubMaintenance == nulland the backend is actually Redis (to catch missing DI registration). Not blocking.
Verification
- Tests cover happy path, failure path, mid-migration scope keys, and reflection-pinned key format.
- Architecture guards passing is good.
Approve with minor comments.
eanzhao
left a comment
There was a problem hiding this comment.
Review Summary
Overall this is a well-targeted fix for the production InconsistentStateException issue. The root cause analysis is solid, the self-heal approach is the right call vs the original fail-fast, and test coverage is thorough.
I have two concerns worth addressing before merge:
1. Self-heal path in ProjectionScopeActorRuntime is not defensive against pub/sub reset failures
File: src/Aevatar.CQRS.Projection.Core/Orchestration/ProjectionScopeActorRuntime.cs
await _runtime.DestroyAsync(actorId, ct).ConfigureAwait(false);
if (_streamPubSubMaintenance != null)
await _streamPubSubMaintenance.ResetActorStreamPubSubAsync(actorId, ct).ConfigureAwait(false);
_ = await _runtime.CreateAsync<TScopeAgent>(actorId, ct).ConfigureAwait(false);The current OrleansRedisStreamPubSubMaintenance implementation catches all exceptions internally (returns false), so this is safe today. However:
- The interface contract
IStreamPubSubMaintenancedoes not guarantee non-throwing behavior — a future implementation could throw. - If
ResetActorStreamPubSubAsyncthrows here, the actor is destroyed but not recreated — strictly worse than the pre-PR state (type mismatch but the actor still existed). - Meanwhile,
RetiredActorCleanupHostedService.CleanupStreamPubSubBestEffortAsyncalready wraps the same call in try-catch for exactly this reason. The self-heal path should follow the same pattern.
Suggested fix:
await _runtime.DestroyAsync(actorId, ct).ConfigureAwait(false);
if (_streamPubSubMaintenance != null)
{
try
{
await _streamPubSubMaintenance
.ResetActorStreamPubSubAsync(actorId, ct)
.ConfigureAwait(false);
}
catch (OperationCanceledException) when (ct.IsCancellationRequested) { throw; }
catch (Exception ex)
{
_logger.LogWarning(ex,
"Pub/sub state reset failed during self-heal for {ActorId}; proceeding with recreate.",
actorId);
}
}
_ = await _runtime.CreateAsync<TScopeAgent>(actorId, ct).ConfigureAwait(false);This ensures the recreate always runs regardless of pub/sub reset outcome, matching the best-effort semantics already established in RetiredActorCleanupHostedService.
2. Test EnsureExistsAsync_ShouldDestroyResetAndRecreate_WhenActorTypeIsStale doesn't verify operation ordering including pub/sub
File: test/Aevatar.CQRS.Projection.Core.Tests/ProjectionScopeActorRuntimeTests.cs:93
The test asserts:
runtime.OperationLog.Should().Equal("destroy:" + actorId, "create:" + actorId);
pubSub.ResetActorIds.Should().Equal(actorId);These are two independent assertions. The OperationLog only tracks runtime operations (destroy/create), not pub/sub. So the test proves:
- destroy happened
- pub/sub reset happened
- create happened
- destroy before create
But it does not prove pub/sub reset happened between destroy and create. A bug that reordered the sequence to (pub/sub reset → destroy → create) or (destroy → create → pub/sub reset) would still pass this test.
Since the PR description specifically calls out ordering as critical ("Otherwise the recreated actor's RegisterAsStreamProducer would still see the stale etag from the previous incarnation"), the test should capture all three operations in a single ordered log.
Suggested fix: Capture pub/sub calls in the OperationLog (or use a unified List<string> across all mock objects) so you can assert:
operationLog.Should().Equal("destroy:" + actorId, "pubsub-reset:" + actorId, "create:" + actorId);Minor observations (non-blocking)
- The
InternalsVisibleToadditions for test projects are fine for now but should be tracked if the internal surface grows. - The reflection-based key format validation in
OrleansRedisStreamPubSubMaintenanceTestsis a great safety net for Orleans upgrades.
Self-heal in ProjectionScopeActorRuntime.EnsureExistsAsync now wraps the pub/sub reset call in try-catch matching the best-effort policy in RetiredActorCleanupHostedService. A future IStreamPubSubMaintenance impl that throws would otherwise leave the actor destroyed but not recreated — strictly worse than the pre-self-heal type-mismatch state where the actor at least existed. Self-heal test now uses a single shared operation log across the runtime and pub/sub fakes so it asserts the full destroy → pubsub-reset → create sequence. The previous two-list approach could miss a reordering bug (pub/sub reset happening before destroy or after create) since each assertion ran in isolation. Adds a sibling test covering the new defensive path: pub/sub reset throws, actor still recreates. Also strengthens the Orleans-coupling comment on OrleansRedisStreamPubSubMaintenance.ResetActorStreamPubSubAsync to call out the reflection-based test guard so future Orleans bumps know where to re-verify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## dev #501 +/- ##
==========================================
+ Coverage 71.51% 71.55% +0.03%
==========================================
Files 1235 1236 +1
Lines 89467 89569 +102
Branches 11705 11712 +7
==========================================
+ Hits 63985 64088 +103
+ Misses 20898 20897 -1
Partials 4584 4584
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Summary
IStreamPubSubMaintenanceabstraction + Orleans+Redis implementation that deletes thePubSubRendezvousGrainredis key after retired-actor destroy. Pinned to Orleans 10.0.x's actualRedisGrainStorage.DefaultGetStorageKeyformat via reflection in tests.RetiredActorCleanupHostedServicenow invokes pub/sub reset afterDestroyAsync+ event-stream reset for every cleaned actor (best-effort; tolerant of missing impl + transient failures).ProjectionScopeActorRuntime.EnsureExistsAsyncself-heals on type mismatch — destroy → pub/sub reset → recreate — instead of throwingInvalidOperationExceptionthat blocks projection startup forever.ScheduledRetiredActorSpec.Targetsadds the newuser-agent-catalog-read-modelscope key so mid-migration deploys with the oldChannelRuntime.UserAgentCatalogMaterializationContexttype marker still bound auto-recover.Why
Production silo logs (
aevatar-console-backend-6595d78756-2wm8z) show:Followed by:
Cause chain:
RetiredActorCleanupHostedServiceclearedchannel-bot-registration-storeGAgent state + event store, but Orleans'sPubSubRendezvousGrain(separatePubSubStoreredis key) still holds the previous silo's etag.RegisterAsStreamProducer→InconsistentStateException.ChannelBotRegistrationDocumentnever materializes.NyxIdRelayScopeResolver.ListByNyxAgentApiKeyIdAsyncreturns empty → relay 401 → Lark bot reply never delivered.UserAgentCatalogStartupService5x failure withActor 'projection.durable.scope:user-agent-catalog-read-model:agent-registry-store' is not a 'ProjectionMaterializationScopeGAgent<Scheduled.UserAgentCatalogMaterializationContext>' projection scope actoris the same class of bug at the new scope key.Test plan
dotnet build aevatar.slnx --nologo— succeeds, 0 errorsdotnet test test/Aevatar.Foundation.Runtime.Hosting.Tests/...— 117 tests pass (incl. 3 newOrleansRedisStreamPubSubMaintenanceTests)dotnet test test/Aevatar.Hosting.Tests/...— 49 tests pass (incl. 3 new cleanup tests covering pub/sub reset, throw-tolerance, mid-migration scope key)dotnet test test/Aevatar.CQRS.Projection.Core.Tests/...— 85 tests pass (incl. 4 new self-heal tests)dotnet test test/Aevatar.Architecture.Tests/...— 105 tests passbash tools/ci/architecture_guards.sh— all guards pass through workflow_binding_boundary_guard.playground_asset_drift_guard.shfails locally with stale demos/wwwroot assets vs cli playground build — pre-existing on this machine, unrelated to this PR (no frontend files touched).{grainId}/{serviceId}) by invoking Orleans 10.0.1's actualRedisGrainStorage.DefaultGetStorageKeyvia reflection — test pins the format so future Orleans updates surface as test failure rather than silent stale-state leak.Rollout
After merge + deploy, restart the affected silo.
RetiredActorCleanupHostedServicewill sweepprojection.durable.scope:user-agent-catalog-read-model:agent-registry-storeand any orphaned pub/sub state during startup; subsequentRegisterAsStreamProducerwrites from a clean slate. Lark bot reply path recovers on next webhook. No manual redis surgery required.🤖 Generated with Claude Code