Skip to content

fix(decopilot): cross-pod recovery actually fires on pod death#3393

Open
viktormarinho wants to merge 3 commits into
mainfrom
viktormarinho/pod-death-recovery
Open

fix(decopilot): cross-pod recovery actually fires on pod death#3393
viktormarinho wants to merge 3 commits into
mainfrom
viktormarinho/pod-death-recovery

Conversation

@viktormarinho
Copy link
Copy Markdown
Contributor

@viktormarinho viktormarinho commented May 17, 2026

What is this contribution about?

Makes the previously-skipped `pod-death-dbos-replay` scenario pass end-to-end. Three coordinated fixes, each independently load-bearing — the recovery path was broken in three places at once.

1. Heartbeat watcher gains a polling fallback for hard kills. NATS KV's TTL-based key expiry is server-side cleanup and does NOT emit DEL/PURGE on the watcher — `kv.watch()` only catches graceful `kv.delete()`. A SIGKILL'd pod's heartbeat vanished silently and `handlePodDeath` never fired. Added a parallel `kv.keys()` poller (~10s tick) that diffs snapshots; the watcher still handles graceful shutdowns, the poller catches the rest. Detection latency: ~55s worst case.

2. Heartbeat-recovered runs now pump to JetStream. `resumeOrphanedThread` was deliberately dropping `streamBuffer` when calling `dispatchRunAndWait`, on the assumption that DBOS workflow replay would carry chunks separately. DBOS doesn't replay cross-pod, so the resumed run streamed to /dev/null and survivor `/attach` tails saw nothing. Pass `streamBuffer` through.

3. `prepareRun` skips JetStream purge on resume. `streamBuffer.purge()` ran unconditionally at every dispatch start, including the resume path — wiping the prefix chunks survivor watchers were mid-consumption of. Gate on `!input.isResume`.

Future cleanup left for follow-up: replace the NATS-KV liveness path with Postgres advisory locks (sub-second detection vs ~55s) — already discussed, deserves its own PR.

How to Test

```bash
./tests/multi-pod/run.sh
```

Expected: `7/7 pass`. The `pod-death-dbos-replay` scenario was `.skip`ped before this PR; it's now un-skipped and exercises the full recovery loop (POST → run flowing → SIGKILL owner → poller detects → recovery on survivor → /attach sees final chunk).

Migration Notes

The mesh image is now tagged `multi-pod-studio:latest` and shared across all three mesh services in test compose, instead of being built three times under three different tags. No production impact — only affects test infrastructure.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working (7/7 pass locally including the previously-skipped pod-death scenario)
  • No breaking changes — heartbeat polling is additive; `streamBuffer` was always optional in deps; the `!isResume` guard is a strict subset of previous behavior

Summary by cubic

Fixes cross-pod recovery so pod death is detected (including after NATS reconnects) and orphaned runs resume with output visible to clients. Late /attach sees the response once (no duplicates).

  • Bug Fixes

    • Heartbeat adds a KV poller (kv.keys() every ~10s) to catch hard kills missed by kv.watch(); worst-case detection ~55s.
    • Heartbeat now re-arms the watcher and poller after NATS reconnects, so death detection doesn’t silently stop.
    • Resumed runs publish to JetStream by passing streamBuffer to dispatchRunAndWait.
    • Keep unconditional streamBuffer.purge() on dispatch (including resume) to prevent duplicate replies for late /attach; in-flight watchers may still render duplicates and need a follow-up.
  • Migration

    • No production changes. Test compose shares one image tag multi-pod-studio:latest across mesh services.

Written for commit 3b61478. Summary will update on new commits. Review in cubic

Three coordinated changes make the previously-skipped
pod-death-dbos-replay scenario pass end-to-end.

**1. Heartbeat watcher gains a poller for hard kills.**

NATS JetStream KV's TTL-based key expiry is a server-side stream
cleanup pass — it does NOT emit a DEL/PURGE op on the watcher. So
`kv.watch()` alone only catches *graceful* shutdowns (explicit
`kv.delete()`). A SIGKILL'd pod's heartbeat key vanishes silently
from the bucket with no notification, and `runRegistry.handlePodDeath`
never fires. Added a parallel poller that lists `kv.keys()` every 10s
and treats anything that drops out of the snapshot as a death. The
watcher still handles the fast graceful-shutdown path; the poller is
the safety net for hard kills.

Detection latency: ~45s (TTL) + up to 10s (poll tick) ≈ 55s worst case.
A Postgres advisory lock would be sub-second, but that's a deeper
refactor and would deserve its own PR.

**2. Heartbeat-recovered runs now reach JetStream.**

`resumeOrphanedThread` in app.ts was deliberately dropping
`streamBuffer` when calling `dispatchRunAndWait`, on the assumption
that DBOS workflow replay would pump chunks separately. Turns out
DBOS doesn't replay cross-pod (no built-in liveness scan, single
shared executor_id), so the resumed run streamed to /dev/null and
any /attach tails on survivor pods saw nothing. Pass `streamBuffer`
so chunks land back on the per-thread subject.

**3. `prepareRun` skips JetStream purge on resume.**

`streamBuffer.purge()` ran unconditionally at the start of every
dispatch, including the resume path. That wiped the prefix chunks
the dead owner had already pumped, which survivor watchers were
mid-consumption of. Gate on `!input.isResume`.

Plus minor cleanups:
- `PodHeartbeat.isReady()` so callers can distinguish "init resolved"
  from "init was a no-op because NATS wasn't ready yet".
- `[PodHeartbeat] Started` and `[RunRegistry] handlePodDeath` log lines
  for production observability.
- Compose: share one studio image across mesh-1/2/3 via a single
  `image: multi-pod-studio:latest` tag instead of three separate
  per-service builds. Eliminates `docker tag` workarounds.

Verified with `tests/multi-pod/scenarios/` — pod-death scenario
un-skipped, 7/7 pass in ~63s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

Release Options

Suggested: Patch (2.330.3) — based on fix: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.330.3-alpha.1
🎉 Patch 2.330.3
❤️ Minor 2.331.0
🚀 Major 3.0.0

Current version: 2.330.2

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/mesh/src/nats/pod-heartbeat.ts">

<violation number="1" location="apps/mesh/src/nats/pod-heartbeat.ts:118">
P2: Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize `knownPods` before the next poll diff.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

// Activate deferred death watcher if registered before init
if (this.pendingDeathCallback) {
this.startDeathWatcher(this.pendingDeathCallback);
this.startDeathPoller();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize knownPods before the next poll diff.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/mesh/src/nats/pod-heartbeat.ts, line 118:

<comment>Starting the poller in parallel with the watcher can emit duplicate pod-death callbacks for graceful deletes, because watcher notifications don’t synchronize `knownPods` before the next poll diff.</comment>

<file context>
@@ -88,17 +115,65 @@ export class NatsPodHeartbeat implements PodHeartbeat {
     // Activate deferred death watcher if registered before init
     if (this.pendingDeathCallback) {
       this.startDeathWatcher(this.pendingDeathCallback);
+      this.startDeathPoller();
       this.pendingDeathCallback = null;
     }
</file context>

viktormarinho and others added 2 commits May 17, 2026 21:08
…test

A previous commit on this branch skipped streamBuffer.purge() when
input.isResume was true, on the theory that the purge wiped chunks
survivor /attach watchers were mid-consumption of. Empirical
verification shows that theory was wrong:

- NATS JetStream consumers track their own read position. A server-
  side purge removes messages from the stream but doesn't touch a
  consumer's local buffer for messages it has already pulled. The
  consumer just sees a sequence-number jump under the hood and keeps
  receiving new messages.

- Meanwhile, skipping purge on resume creates a real bug: the dead
  pod's chunks 1..K stay in JetStream, and the resumed run publishes
  chunks 1..N alongside. Any /attach opened during the recovery
  window with deliverPolicy: "all" replays both — the user sees the
  assistant's response twice.

So the unconditional purge in prepareRun is correct. Reverted to the
original behavior with a much clearer comment about why.

The pod-death scenario now opens a fresh /attach AFTER recovery has
visibly started (we detect "chunk-3" on a survivor — chunks 1-2 can
only come from the dead pod, chunk-3+ can only come from the resumed
run) and asserts that the late watcher sees "chunk-1 " exactly once.
Negative-tested by re-introducing the skip-on-resume branch: the
assertion fails with chunk-1 count = 2, as expected. So the test is
a real regression guard for this whole class of bug.

The docstring is also rewritten now that the scenario is live —
previously it described why it was skipped + a long architectural
finding, now it describes what each step exercises and the three
fixes that had to land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`init()` aborts the watcher and clears the poller timer to drop stale
state. On the first call, `start()` re-arms both via the
`pendingDeathCallback` field that `onPodDeath()` populated before init
completed. But `pendingDeathCallback` is consumed (set to null) inside
that first `start()` — so on a NATS reconnect, `init()` happens again,
the watcher/poller get torn down again, but `start()`'s re-arm check
sees no pending callback and silently does nothing. Heartbeats keep
refreshing, but every peer death goes undetected until the next pod
restart.

Fix: fall back to the persistent `deathCallback` field (which
`onPodDeath()` always sets) when the pending-callback slot is empty.

Plus two follow-ups from the same review:

- Pod-death test budget bumped 180 → 220s. Sum of the three sequential
  pollUntil windows is 75+90+20=185s worst case; the previous budget
  could race the bun-test timeout in pathological cases.

- dispatch-run.ts purge comment now honestly flags the remaining UX
  gap: /attach watchers already streaming when the owner dies receive
  the dead pod's prefix from their local consumer buffer plus the
  resumed run's full body — so they render the reply twice. The purge
  protects new attachers (the test asserts this), not in-flight ones.
  Proper fix is a "reset" sentinel on the subject so all consumers
  flush; left as a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant