fix(sync-service): bound publication manager cast issuance under shape-arrival bursts (#4396) by erik-the-implementer · Pull Request #4537 · electric-sql/electric

erik-the-implementer · 2026-06-09T12:58:57Z

Summary

RelationTracker was minting a fresh Configurator.configure_publication cast on every add_shape/remove_shape while a publication-configuration submission was already in flight, because update_needed?/1's submitted != committed disjunct stays true for the entire window between "cast sent" and "all relations committed". The Configurator's own coalescing/diffing already bounds the actual SQL, so only the cast count was unbounded — during the 2026-04-28 Ohm deploy the publication_manager_configurator mailbox reached 705,925 queued messages.

This PR adds an in_flight_relation_filters snapshot to RelationTracker:

Set to the submitted filter set in update_publication/1 whenever a cast is issued.
update_needed?/1's committed-lag retry disjunct is now gated on is_nil(in_flight), so a burst of add_shape/remove_shape calls can no longer mint one redundant cast each. The prepared_changed? disjunct is untouched, so a genuinely new relation still issues exactly one fresh cast.
in_flight is cleared on every terminal signal from the Configurator: full commit (clear_in_flight_if_committed/1 in the {:ok, :configured} / {:ok, :dropped} handlers), per-relation error, and global configuration error.
The existing 60s publication_refresh_period inactivity timeout (handle_info(:timeout) → update_publication directly) remains the fallback retry if a chain dies without delivering any result.

Note on clearing `in_flight` on a per-relation error

The issue's written sketch only clears on full-commit + global error. This PR also clears on a per-relation error result, which is required to preserve the existing "should continue to fail with same error" retry semantics. It does not reopen the storm: the production incident delivered no results at all (the chain was stuck on admin-pool locks), so in_flight would correctly stay set there; clearing happens only once an error result is actually delivered, which means the chain made progress. Cast issuance therefore stays bounded by external request rate / the 60s timer, never by a single in-flight submission.

Test plan

New describe "cast issuance" tests in publication_manager_test.exs, all driving the real PublicationManager GenServer tree and intercepting casts:

A burst of same-relation add_shape calls while a submission is in flight issues exactly one Configurator cast.
A new relation added during the in-flight window still issues a cast (narrow-gate guard).
A global configuration error re-arms the retry path (subsequent add re-casts).
A per-relation configuration error re-arms the retry path.
Existing "should continue to fail with same error" still passes (no regression).
mix test test/electric/replication/publication_manager_test.exs → 26 tests, 0 failures.
mix format --check-formatted clean; mix compile --warnings-as-errors clean on the touched files.

Changeset: @core/sync-service patch.

…a submission is in flight (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rrors (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ublication-config cast suppression (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-09T13:00:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.44%. Comparing base (2e4feee) to head (924522f).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4537      +/-   ##
==========================================
- Coverage   56.45%   56.44%   -0.01%     
==========================================
  Files         358      358              
  Lines       39081    39081              
  Branches    10973    10974       +1     
==========================================
- Hits        22064    22061       -3     
- Misses      16946    16948       +2     
- Partials       71       72       +1

Flag	Coverage Δ
packages/agents	`70.75% <ø> (ø)`
packages/agents-mcp	`77.54% <ø> (ø)`
packages/agents-mobile	`66.92% <ø> (ø)`
packages/agents-runtime	`79.98% <ø> (ø)`
packages/agents-server	`73.95% <ø> (-0.03%)`	⬇️
packages/agents-server-ui	`6.21% <ø> (ø)`
packages/electric-ax	`46.42% <ø> (ø)`
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`91.71% <ø> (-0.12%)`	⬇️
packages/y-electric	`56.05% <ø> (ø)`
typescript	`56.44% <ø> (-0.01%)`	⬇️
unit-tests	`56.44% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-06-09T13:05:16Z

Claude Code Review

Summary

This PR fixes the unbounded publication_manager_configurator mailbox growth from #4396 by adding an in_flight_relation_filters snapshot to RelationTracker and gating the "committed lags submitted" retry disjunct in update_needed?/1 on is_nil(in_flight). The change is tightly scoped, correct, well-commented, and well-tested. I recommend merging.

What's Working Well

The core safety invariant is preserved. The prepared_changed? disjunct in update_needed?/1 is left completely ungated — only the committed-lag retry is suppressed while a submission is in flight. This is exactly the right cut: a genuinely new relation (or any change to the target set) always issues a fresh cast, while pure retries against an already-in-flight filter set are coalesced to one. The fix targets the cast-issuance rate without touching correctness of convergence.
in_flight is cleared on every terminal signal, not just full commit: clear_in_flight_if_committed/1 on {:ok, :configured}/{:ok, :dropped}, plus unconditional clears on per-relation error (relation_tracker.ex:310) and global error (relation_tracker.ex:313). This avoids a permanently-stuck gate.
Clearing on per-relation error is the right call and is correctly justified in the PR description. It preserves the existing "should continue to fail with same error" retry semantics, and it can't reopen the storm: an error result means the chain actually ran (made progress), so re-arming is bounded by the error-delivery rate, not by request volume. The production incident delivered no results (stuck on admin-pool locks), so in_flight correctly stays set in that scenario.
Tests are thorough and exercise the real GenServer tree. The four new tests cover the single-cast-while-in-flight path, the new-relation guard (verifying prepared_changed? still fires), and both error re-arm paths. Patching configure_publication to capture-without-forwarding is a clean way to hold committed behind submitted for the whole in-flight window. Changeset present (@core/sync-service patch).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

1. Document the recovery dependency on Configurator restart / the 60s timer.

With the gate in place, recovery from a Configurator that dies mid-chain without delivering a result or a :configuration_error no longer happens on "the next add_shape" (as it did before this PR). Instead it relies on two existing mechanisms:

The :one_for_one supervisor restarts the Configurator, whose init → fetch_initial_filters → {:configure_filters, ...} path re-diffs PG and re-notifies all target relations (to_preserve as {:ok, :configured}), so committed converges back up to in_flight and clear_in_flight_if_committed/1 fires. This is robust and is exactly the resilience documented in supervisor.ex's moduledoc.
The 60s publication_refresh_period timeout (handle_info(:timeout) → update_publication) as the backstop.

This is not a regression in practice — in the original incident the Configurator was alive but blocked on locks, and pre-PR re-casts just coalesced into scheduled_filters against the same stuck process, so they didn't aid recovery either. But it's worth a one-line comment near the gate noting that silent-death recovery now leans on the Configurator's restart re-notification, since that coupling is no longer obvious from RelationTracker alone.

2. handle_info(:timeout) is reset by every message, so the 60s backstop only fires after 60s of true mailbox quiet.

The GenServer receive-timeout is reset on any received message. Under sustained add_shape/remove_shape traffic that doesn't change prepared (e.g. repeated shapes on already-tracked relations), the :timeout fallback never fires. In the edge cases where in_flight can linger as a non-nil MapSet without a result ever arriving — e.g. an empty-target submission produces relation_actions == [] and therefore no result cast (configurator.ex:211), or the can_update_publication? == false branch never notifies to_drop relations (configurator.ex:184-202) so committed stays a superset of in_flight — the gate stays set and the backstop may be deferred by ongoing traffic. None of these strand a waiter (empty target has none; to_preserve/error paths still reply), and prepared_changed? self-corrects on the next real change, so this is harmless today. Flagging it only because the design now relies on prepared_changed? rather than the timer to break out of these states; if a future change adds a target-set state that can't be reached via a prepared change, the lingering in_flight would matter.

3. (Optional) Normalize in_flight to nil on the empty-target / no-result path. Minor: in the no-result edge cases above, in_flight can be left as a non-nil empty/partial MapSet. The :configured/:dropped handlers already converge it for the common paths; the only gap is "no result is ever produced." Not worth special-casing unless #2 ever becomes load-bearing.

Issue Conformance

Excellent issue-to-PR alignment. #4396 is unusually well-specified (root cause, observed mailbox depth of 705,925, exact line references, and a concrete fix sketch). The implementation follows the sketch faithfully and the PR description explicitly calls out and justifies the one deliberate deviation (also clearing in_flight on a per-relation error, which the issue sketch omitted). The issue's "Secondary hardening" items (yield between perform_relation_actions chunks, per-action DB timeout, mid-chain scheduled_filters refresh) are correctly scoped out of this PR and recommended to land separately — none are needed for the cast-rate fix, and configurator.ex:216's per-relation tail-recursion is left untouched as intended.

Previous Review Status

N/A — first review of this PR.

Review iteration: 1 | 2026-06-09

alco and others added 3 commits June 9, 2026 14:20

fix(sync-service): suppress redundant publication-config casts while …

8e7a30a

…a submission is in flight (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(sync-service): re-arm publication-config retries after terminal e…

66e6cb9

…rrors (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(sync-service): changeset + per-relation-error re-arm test for p…

924522f

…ublication-config cast suppression (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

erik-the-implementer added the claude label Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync-service): bound publication manager cast issuance under shape-arrival bursts (#4396)#4537

fix(sync-service): bound publication manager cast issuance under shape-arrival bursts (#4396)#4537
erik-the-implementer wants to merge 3 commits into
mainfrom
alco/pub-mgr-cast-suppression

erik-the-implementer commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erik-the-implementer commented Jun 9, 2026

Summary

Note on clearing in_flight on a per-relation error

Test plan

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 9, 2026

Claude Code Review

Summary

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Note on clearing `in_flight` on a per-relation error

codecov Bot commented Jun 9, 2026 •

edited

Loading