fix(sync-service): bound publication manager cast issuance under shape-arrival bursts (#4396)#4537
fix(sync-service): bound publication manager cast issuance under shape-arrival bursts (#4396)#4537erik-the-implementer wants to merge 3 commits into
Conversation
…a submission is in flight (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rrors (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ublication-config cast suppression (#4396) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4537 +/- ##
==========================================
- Coverage 56.45% 56.44% -0.01%
==========================================
Files 358 358
Lines 39081 39081
Branches 10973 10974 +1
==========================================
- Hits 22064 22061 -3
- Misses 16946 16948 +2
- Partials 71 72 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryThis PR fixes the unbounded What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)1. Document the recovery dependency on Configurator restart / the 60s timer. With the gate in place, recovery from a Configurator that dies mid-chain without delivering a result or a
This is not a regression in practice — in the original incident the Configurator was alive but blocked on locks, and pre-PR re-casts just coalesced into 2. The GenServer receive-timeout is reset on any received message. Under sustained 3. (Optional) Normalize Issue ConformanceExcellent issue-to-PR alignment. #4396 is unusually well-specified (root cause, observed mailbox depth of 705,925, exact line references, and a concrete fix sketch). The implementation follows the sketch faithfully and the PR description explicitly calls out and justifies the one deliberate deviation (also clearing Previous Review StatusN/A — first review of this PR. Review iteration: 1 | 2026-06-09 |
Summary
Fixes #4396.
RelationTrackerwas minting a freshConfigurator.configure_publicationcast on everyadd_shape/remove_shapewhile a publication-configuration submission was already in flight, becauseupdate_needed?/1'ssubmitted != committeddisjunct stays true for the entire window between "cast sent" and "all relations committed". The Configurator's own coalescing/diffing already bounds the actual SQL, so only the cast count was unbounded — during the 2026-04-28 Ohm deploy thepublication_manager_configuratormailbox reached 705,925 queued messages.This PR adds an
in_flight_relation_filterssnapshot toRelationTracker:update_publication/1whenever a cast is issued.update_needed?/1's committed-lag retry disjunct is now gated onis_nil(in_flight), so a burst ofadd_shape/remove_shapecalls can no longer mint one redundant cast each. Theprepared_changed?disjunct is untouched, so a genuinely new relation still issues exactly one fresh cast.in_flightis cleared on every terminal signal from the Configurator: full commit (clear_in_flight_if_committed/1in the{:ok, :configured}/{:ok, :dropped}handlers), per-relation error, and global configuration error.publication_refresh_periodinactivity timeout (handle_info(:timeout)→update_publicationdirectly) remains the fallback retry if a chain dies without delivering any result.Note on clearing
in_flighton a per-relation errorThe issue's written sketch only clears on full-commit + global error. This PR also clears on a per-relation error result, which is required to preserve the existing
"should continue to fail with same error"retry semantics. It does not reopen the storm: the production incident delivered no results at all (the chain was stuck on admin-pool locks), soin_flightwould correctly stay set there; clearing happens only once an error result is actually delivered, which means the chain made progress. Cast issuance therefore stays bounded by external request rate / the 60s timer, never by a single in-flight submission.Test plan
New
describe "cast issuance"tests inpublication_manager_test.exs, all driving the realPublicationManagerGenServer tree and intercepting casts:add_shapecalls while a submission is in flight issues exactly one Configurator cast."should continue to fail with same error"still passes (no regression).mix test test/electric/replication/publication_manager_test.exs→ 26 tests, 0 failures.mix format --check-formattedclean;mix compile --warnings-as-errorsclean on the touched files.Changeset:
@core/sync-servicepatch.