fix(decopilot): cross-pod recovery actually fires on pod death by viktormarinho · Pull Request #3393 · decocms/studio

viktormarinho · 2026-05-17T23:35:18Z

What is this contribution about?

Makes the previously-skipped `pod-death-dbos-replay` scenario pass end-to-end. Three coordinated fixes, each independently load-bearing — the recovery path was broken in three places at once.

1. Heartbeat watcher gains a polling fallback for hard kills. NATS KV's TTL-based key expiry is server-side cleanup and does NOT emit DEL/PURGE on the watcher — `kv.watch()` only catches graceful `kv.delete()`. A SIGKILL'd pod's heartbeat vanished silently and `handlePodDeath` never fired. Added a parallel `kv.keys()` poller (~10s tick) that diffs snapshots; the watcher still handles graceful shutdowns, the poller catches the rest. Detection latency: ~55s worst case.

2. Heartbeat-recovered runs now pump to JetStream. `resumeOrphanedThread` was deliberately dropping `streamBuffer` when calling `dispatchRunAndWait`, on the assumption that DBOS workflow replay would carry chunks separately. DBOS doesn't replay cross-pod, so the resumed run streamed to /dev/null and survivor `/attach` tails saw nothing. Pass `streamBuffer` through.

3. `prepareRun` skips JetStream purge on resume. `streamBuffer.purge()` ran unconditionally at every dispatch start, including the resume path — wiping the prefix chunks survivor watchers were mid-consumption of. Gate on `!input.isResume`.

Future cleanup left for follow-up: replace the NATS-KV liveness path with Postgres advisory locks (sub-second detection vs ~55s) — already discussed, deserves its own PR.

How to Test

```bash
./tests/multi-pod/run.sh
```

Expected: `7/7 pass`. The `pod-death-dbos-replay` scenario was `.skip`ped before this PR; it's now un-skipped and exercises the full recovery loop (POST → run flowing → SIGKILL owner → poller detects → recovery on survivor → /attach sees final chunk).

Migration Notes

The mesh image is now tagged `multi-pod-studio:latest` and shared across all three mesh services in test compose, instead of being built three times under three different tags. No production impact — only affects test infrastructure.

Review Checklist

PR title is clear and descriptive
Changes are tested and working (7/7 pass locally including the previously-skipped pod-death scenario)
No breaking changes — heartbeat polling is additive; `streamBuffer` was always optional in deps; the `!isResume` guard is a strict subset of previous behavior

Summary by cubic

Fixes cross-pod recovery so pod death is detected (including after NATS reconnects) and orphaned runs resume with output visible to clients. Late /attach sees the response once (no duplicates).

Bug Fixes
- Heartbeat adds a KV poller (kv.keys() every ~10s) to catch hard kills missed by kv.watch(); worst-case detection ~55s.
- Heartbeat now re-arms the watcher and poller after NATS reconnects, so death detection doesn’t silently stop.
- Resumed runs publish to JetStream by passing streamBuffer to dispatchRunAndWait.
- Keep unconditional streamBuffer.purge() on dispatch (including resume) to prevent duplicate replies for late /attach; in-flight watchers may still render duplicates and need a follow-up.
Migration
- No production changes. Test compose shares one image tag multi-pod-studio:latest across mesh services.

^{Written for commit 3b61478. Summary will update on new commits. Review in cubic}

Three coordinated changes make the previously-skipped pod-death-dbos-replay scenario pass end-to-end. **1. Heartbeat watcher gains a poller for hard kills.** NATS JetStream KV's TTL-based key expiry is a server-side stream cleanup pass — it does NOT emit a DEL/PURGE op on the watcher. So `kv.watch()` alone only catches *graceful* shutdowns (explicit `kv.delete()`). A SIGKILL'd pod's heartbeat key vanishes silently from the bucket with no notification, and `runRegistry.handlePodDeath` never fires. Added a parallel poller that lists `kv.keys()` every 10s and treats anything that drops out of the snapshot as a death. The watcher still handles the fast graceful-shutdown path; the poller is the safety net for hard kills. Detection latency: ~45s (TTL) + up to 10s (poll tick) ≈ 55s worst case. A Postgres advisory lock would be sub-second, but that's a deeper refactor and would deserve its own PR. **2. Heartbeat-recovered runs now reach JetStream.** `resumeOrphanedThread` in app.ts was deliberately dropping `streamBuffer` when calling `dispatchRunAndWait`, on the assumption that DBOS workflow replay would pump chunks separately. Turns out DBOS doesn't replay cross-pod (no built-in liveness scan, single shared executor_id), so the resumed run streamed to /dev/null and any /attach tails on survivor pods saw nothing. Pass `streamBuffer` so chunks land back on the per-thread subject. **3. `prepareRun` skips JetStream purge on resume.** `streamBuffer.purge()` ran unconditionally at the start of every dispatch, including the resume path. That wiped the prefix chunks the dead owner had already pumped, which survivor watchers were mid-consumption of. Gate on `!input.isResume`. Plus minor cleanups: - `PodHeartbeat.isReady()` so callers can distinguish "init resolved" from "init was a no-op because NATS wasn't ready yet". - `[PodHeartbeat] Started` and `[RunRegistry] handlePodDeath` log lines for production observability. - Compose: share one studio image across mesh-1/2/3 via a single `image: multi-pod-studio:latest` tag instead of three separate per-service builds. Eliminates `docker tag` workarounds. Verified with `tests/multi-pod/scenarios/` — pod-death scenario un-skipped, 7/7 pass in ~63s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-17T23:35:27Z

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction	Action
👍	Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

github-actions · 2026-05-17T23:35:27Z

Release Options

Suggested: Patch (2.330.3) — based on fix: prefix

React with an emoji to override the release type:

Reaction	Type	Next Version
👍	Prerelease	`2.330.3-alpha.1`
🎉	Patch	`2.330.3`
❤️	Minor	`2.331.0`
🚀	Major	`3.0.0`

Current version: 2.330.2

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

cubic-dev-ai

1 issue found across 6 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/nats/pod-heartbeat.ts">

<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:118">
P2: Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize `knownPods` before the next poll diff.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

cubic-dev-ai · 2026-05-17T23:38:48Z

    // Activate deferred death watcher if registered before init
    if (this.pendingDeathCallback) {
      this.startDeathWatcher(this.pendingDeathCallback);
+      this.startDeathPoller();


P2: Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize knownPods before the next poll diff.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/nats/pod-heartbeat.ts, line 118: <comment>Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize `knownPods` before the next poll diff.</comment> <file context> @@ -88,17 +115,65 @@ export class NatsPodHeartbeat implements PodHeartbeat { // Activate deferred death watcher if registered before init if (this.pendingDeathCallback) { this.startDeathWatcher(this.pendingDeathCallback); + this.startDeathPoller(); this.pendingDeathCallback = null; } </file context>

…test A previous commit on this branch skipped streamBuffer.purge() when input.isResume was true, on the theory that the purge wiped chunks survivor /attach watchers were mid-consumption of. Empirical verification shows that theory was wrong: - NATS JetStream consumers track their own read position. A server- side purge removes messages from the stream but doesn't touch a consumer's local buffer for messages it has already pulled. The consumer just sees a sequence-number jump under the hood and keeps receiving new messages. - Meanwhile, skipping purge on resume creates a real bug: the dead pod's chunks 1..K stay in JetStream, and the resumed run publishes chunks 1..N alongside. Any /attach opened during the recovery window with deliverPolicy: "all" replays both — the user sees the assistant's response twice. So the unconditional purge in prepareRun is correct. Reverted to the original behavior with a much clearer comment about why. The pod-death scenario now opens a fresh /attach AFTER recovery has visibly started (we detect "chunk-3" on a survivor — chunks 1-2 can only come from the dead pod, chunk-3+ can only come from the resumed run) and asserts that the late watcher sees "chunk-1 " exactly once. Negative-tested by re-introducing the skip-on-resume branch: the assertion fails with chunk-1 count = 2, as expected. So the test is a real regression guard for this whole class of bug. The docstring is also rewritten now that the scenario is live — previously it described why it was skipped + a long architectural finding, now it describes what each step exercises and the three fixes that had to land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`init()` aborts the watcher and clears the poller timer to drop stale state. On the first call, `start()` re-arms both via the `pendingDeathCallback` field that `onPodDeath()` populated before init completed. But `pendingDeathCallback` is consumed (set to null) inside that first `start()` — so on a NATS reconnect, `init()` happens again, the watcher/poller get torn down again, but `start()`'s re-arm check sees no pending callback and silently does nothing. Heartbeats keep refreshing, but every peer death goes undetected until the next pod restart. Fix: fall back to the persistent `deathCallback` field (which `onPodDeath()` always sets) when the pending-callback slot is empty. Plus two follow-ups from the same review: - Pod-death test budget bumped 180 → 220s. Sum of the three sequential pollUntil windows is 75+90+20=185s worst case; the previous budget could race the bun-test timeout in pathological cases. - dispatch-run.ts purge comment now honestly flags the remaining UX gap: /attach watchers already streaming when the owner dies receive the dead pod's prefix from their local consumer buffer plus the resumed run's full body — so they render the reply twice. The purge protects new attachers (the test asserts this), not in-flight ones. Proper fix is a "reset" sentinel on the subject so all consumers flush; left as a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 17, 2026

View reviewed changes

viktormarinho and others added 2 commits May 17, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(decopilot): cross-pod recovery actually fires on pod death#3393

fix(decopilot): cross-pod recovery actually fires on pod death#3393
viktormarinho wants to merge 3 commits into
mainfrom
viktormarinho/pod-death-recovery

viktormarinho commented May 17, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viktormarinho commented May 17, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this contribution about?

How to Test

Migration Notes

Review Checklist

Summary by cubic

Uh oh!

github-actions Bot commented May 17, 2026

🧪 Benchmark

Uh oh!

github-actions Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Options

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viktormarinho commented May 17, 2026 •

edited by cubic-dev-ai Bot

Loading

github-actions Bot commented May 17, 2026 •

edited

Loading