fix(decopilot): cross-pod recovery actually fires on pod death#3393
fix(decopilot): cross-pod recovery actually fires on pod death#3393viktormarinho wants to merge 3 commits into
Conversation
Three coordinated changes make the previously-skipped pod-death-dbos-replay scenario pass end-to-end. **1. Heartbeat watcher gains a poller for hard kills.** NATS JetStream KV's TTL-based key expiry is a server-side stream cleanup pass — it does NOT emit a DEL/PURGE op on the watcher. So `kv.watch()` alone only catches *graceful* shutdowns (explicit `kv.delete()`). A SIGKILL'd pod's heartbeat key vanishes silently from the bucket with no notification, and `runRegistry.handlePodDeath` never fires. Added a parallel poller that lists `kv.keys()` every 10s and treats anything that drops out of the snapshot as a death. The watcher still handles the fast graceful-shutdown path; the poller is the safety net for hard kills. Detection latency: ~45s (TTL) + up to 10s (poll tick) ≈ 55s worst case. A Postgres advisory lock would be sub-second, but that's a deeper refactor and would deserve its own PR. **2. Heartbeat-recovered runs now reach JetStream.** `resumeOrphanedThread` in app.ts was deliberately dropping `streamBuffer` when calling `dispatchRunAndWait`, on the assumption that DBOS workflow replay would pump chunks separately. Turns out DBOS doesn't replay cross-pod (no built-in liveness scan, single shared executor_id), so the resumed run streamed to /dev/null and any /attach tails on survivor pods saw nothing. Pass `streamBuffer` so chunks land back on the per-thread subject. **3. `prepareRun` skips JetStream purge on resume.** `streamBuffer.purge()` ran unconditionally at the start of every dispatch, including the resume path. That wiped the prefix chunks the dead owner had already pumped, which survivor watchers were mid-consumption of. Gate on `!input.isResume`. Plus minor cleanups: - `PodHeartbeat.isReady()` so callers can distinguish "init resolved" from "init was a no-op because NATS wasn't ready yet". - `[PodHeartbeat] Started` and `[RunRegistry] handlePodDeath` log lines for production observability. - Compose: share one studio image across mesh-1/2/3 via a single `image: multi-pod-studio:latest` tag instead of three separate per-service builds. Eliminates `docker tag` workarounds. Verified with `tests/multi-pod/scenarios/` — pod-death scenario un-skipped, 7/7 pass in ~63s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Release OptionsSuggested: Patch ( React with an emoji to override the release type:
Current version:
|
There was a problem hiding this comment.
1 issue found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/mesh/src/nats/pod-heartbeat.ts">
<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:118">
P2: Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize `knownPods` before the next poll diff.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
| // Activate deferred death watcher if registered before init | ||
| if (this.pendingDeathCallback) { | ||
| this.startDeathWatcher(this.pendingDeathCallback); | ||
| this.startDeathPoller(); |
There was a problem hiding this comment.
P2: Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize knownPods before the next poll diff.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/nats/pod-heartbeat.ts, line 118:
<comment>Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize `knownPods` before the next poll diff.</comment>
<file context>
@@ -88,17 +115,65 @@ export class NatsPodHeartbeat implements PodHeartbeat {
// Activate deferred death watcher if registered before init
if (this.pendingDeathCallback) {
this.startDeathWatcher(this.pendingDeathCallback);
+ this.startDeathPoller();
this.pendingDeathCallback = null;
}
</file context>
…test A previous commit on this branch skipped streamBuffer.purge() when input.isResume was true, on the theory that the purge wiped chunks survivor /attach watchers were mid-consumption of. Empirical verification shows that theory was wrong: - NATS JetStream consumers track their own read position. A server- side purge removes messages from the stream but doesn't touch a consumer's local buffer for messages it has already pulled. The consumer just sees a sequence-number jump under the hood and keeps receiving new messages. - Meanwhile, skipping purge on resume creates a real bug: the dead pod's chunks 1..K stay in JetStream, and the resumed run publishes chunks 1..N alongside. Any /attach opened during the recovery window with deliverPolicy: "all" replays both — the user sees the assistant's response twice. So the unconditional purge in prepareRun is correct. Reverted to the original behavior with a much clearer comment about why. The pod-death scenario now opens a fresh /attach AFTER recovery has visibly started (we detect "chunk-3" on a survivor — chunks 1-2 can only come from the dead pod, chunk-3+ can only come from the resumed run) and asserts that the late watcher sees "chunk-1 " exactly once. Negative-tested by re-introducing the skip-on-resume branch: the assertion fails with chunk-1 count = 2, as expected. So the test is a real regression guard for this whole class of bug. The docstring is also rewritten now that the scenario is live — previously it described why it was skipped + a long architectural finding, now it describes what each step exercises and the three fixes that had to land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`init()` aborts the watcher and clears the poller timer to drop stale state. On the first call, `start()` re-arms both via the `pendingDeathCallback` field that `onPodDeath()` populated before init completed. But `pendingDeathCallback` is consumed (set to null) inside that first `start()` — so on a NATS reconnect, `init()` happens again, the watcher/poller get torn down again, but `start()`'s re-arm check sees no pending callback and silently does nothing. Heartbeats keep refreshing, but every peer death goes undetected until the next pod restart. Fix: fall back to the persistent `deathCallback` field (which `onPodDeath()` always sets) when the pending-callback slot is empty. Plus two follow-ups from the same review: - Pod-death test budget bumped 180 → 220s. Sum of the three sequential pollUntil windows is 75+90+20=185s worst case; the previous budget could race the bun-test timeout in pathological cases. - dispatch-run.ts purge comment now honestly flags the remaining UX gap: /attach watchers already streaming when the owner dies receive the dead pod's prefix from their local consumer buffer plus the resumed run's full body — so they render the reply twice. The purge protects new attachers (the test asserts this), not in-flight ones. Proper fix is a "reset" sentinel on the subject so all consumers flush; left as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What is this contribution about?
Makes the previously-skipped `pod-death-dbos-replay` scenario pass end-to-end. Three coordinated fixes, each independently load-bearing — the recovery path was broken in three places at once.
1. Heartbeat watcher gains a polling fallback for hard kills. NATS KV's TTL-based key expiry is server-side cleanup and does NOT emit DEL/PURGE on the watcher — `kv.watch()` only catches graceful `kv.delete()`. A SIGKILL'd pod's heartbeat vanished silently and `handlePodDeath` never fired. Added a parallel `kv.keys()` poller (~10s tick) that diffs snapshots; the watcher still handles graceful shutdowns, the poller catches the rest. Detection latency: ~55s worst case.
2. Heartbeat-recovered runs now pump to JetStream. `resumeOrphanedThread` was deliberately dropping `streamBuffer` when calling `dispatchRunAndWait`, on the assumption that DBOS workflow replay would carry chunks separately. DBOS doesn't replay cross-pod, so the resumed run streamed to /dev/null and survivor `/attach` tails saw nothing. Pass `streamBuffer` through.
3. `prepareRun` skips JetStream purge on resume. `streamBuffer.purge()` ran unconditionally at every dispatch start, including the resume path — wiping the prefix chunks survivor watchers were mid-consumption of. Gate on `!input.isResume`.
Future cleanup left for follow-up: replace the NATS-KV liveness path with Postgres advisory locks (sub-second detection vs ~55s) — already discussed, deserves its own PR.
How to Test
```bash
./tests/multi-pod/run.sh
```
Expected: `7/7 pass`. The `pod-death-dbos-replay` scenario was `.skip`ped before this PR; it's now un-skipped and exercises the full recovery loop (POST → run flowing → SIGKILL owner → poller detects → recovery on survivor → /attach sees final chunk).
Migration Notes
The mesh image is now tagged `multi-pod-studio:latest` and shared across all three mesh services in test compose, instead of being built three times under three different tags. No production impact — only affects test infrastructure.
Review Checklist
Summary by cubic
Fixes cross-pod recovery so pod death is detected (including after NATS reconnects) and orphaned runs resume with output visible to clients. Late
/attachsees the response once (no duplicates).Bug Fixes
kv.keys()every ~10s) to catch hard kills missed bykv.watch(); worst-case detection ~55s.streamBuffertodispatchRunAndWait.streamBuffer.purge()on dispatch (including resume) to prevent duplicate replies for late/attach; in-flight watchers may still render duplicates and need a follow-up.Migration
multi-pod-studio:latestacross mesh services.Written for commit 3b61478. Summary will update on new commits. Review in cubic