Skip to content

feat(consumer): bound Shape.Consumer heap growth via spawn opts + adaptive GC#4539

Open
erik-the-implementer wants to merge 10 commits into
mainfrom
consumer-heap-gc
Open

feat(consumer): bound Shape.Consumer heap growth via spawn opts + adaptive GC#4539
erik-the-implementer wants to merge 10 commits into
mainfrom
consumer-heap-gc

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Closes #4476

Summary

Bounds runaway old-heap growth of Shape.Consumer processes (issue #4476). On a production node with 5,043 consumers, ~36 GB was unreclaimable garbage (7–9 MB heaps each holding ~8 KB of live state) because consumers inherit the BEAM default fullsweep_after: 65535 and rarely hibernate, so a full sweep is effectively never triggered.

Two complementary, independently-tunable levers:

  1. Per-process spawn opts (env-driven). Consumer, Consumer.Snapshotter, and Consumer.Materializer now pass spawn_opt: Electric.StackConfig.spawn_opts(stack_id, key) (keys :consumer, :consumer_snapshotter, :consumer_materializer), mirroring ShapeLogCollector. This lets ELECTRIC_PROCESS_SPAWN_OPTS set fullsweep_after per process type. No baked-in defaults — purely env-configured.

  2. Adaptive GC, opt-in, runtime-tunable. After processing a transaction fragment (including during the startup buffer drain), the consumer runs :erlang.garbage_collect() only if its heap exceeds consumer_gc_heap_threshold (bytes; nil = off, the default). The threshold is read from StackConfig on every fragment, so it can be changed live from IEx — see Electric.Shapes.Consumer.set_gc_heap_threshold/2 and set_gc_heap_threshold_all_stacks/1.

Request-handler (GET /v1/shape) processes are intentionally untouched — they already expose fullsweep_after via ELECTRIC_TWEAKS_HANDLER_FULLSWEEP_AFTER and force GC before long-poll blocking.

Config

Env var Meaning Default
ELECTRIC_PROCESS_SPAWN_OPTS JSON map; now accepts consumer/consumer_snapshotter/consumer_materializer keys (e.g. {"consumer":{"fullsweep_after":4}}) %{}
ELECTRIC_CONSUMER_GC_HEAP_THRESHOLD adaptive-GC heap threshold in bytes; nil disables nil

Testing

  • New unit tests for over_heap_threshold?/2 (word/byte conversion, strict > boundary).
  • Integration tests verify GC fires at a tiny threshold (incl. during buffer drain via GC trace events), does not fire at a large threshold, and is a no-op when unset — each empirically confirmed to fail if the GC code is removed.
  • Config-propagation tests assert ELECTRIC_CONSUMER_GC_HEAP_THRESHOLD reaches StackConfig.
  • Full sweep of consumer / config / stack_supervisor / shape_log_collector suites passes; mix compile --warnings-as-errors and mix format --check-formatted clean.

Rollout / rollback

Inert by default: new spawn-opts keys do nothing unless named in ELECTRIC_PROCESS_SPAWN_OPTS, and adaptive GC is off unless the threshold is set. The threshold can be toggled live from IEx without a deploy. Electric Cloud will set fullsweep_after: 4 for the consumer family via env (separate stratovolt change).

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.63%. Comparing base (2e4feee) to head (1b866ed).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4539      +/-   ##
==========================================
+ Coverage   56.45%   56.63%   +0.17%     
==========================================
  Files         358      359       +1     
  Lines       39081    39329     +248     
  Branches    10973    11048      +75     
==========================================
+ Hits        22064    22274     +210     
- Misses      16946    16983      +37     
- Partials       71       72       +1     
Flag Coverage Δ
packages/agents 70.53% <ø> (-0.23%) ⬇️
packages/agents-mcp 77.54% <ø> (ø)
packages/agents-mobile 66.92% <ø> (ø)
packages/agents-runtime 80.11% <ø> (+0.13%) ⬆️
packages/agents-server 73.92% <ø> (-0.06%) ⬇️
packages/agents-server-ui 6.21% <ø> (ø)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.71% <ø> (-0.12%) ⬇️
packages/y-electric 56.05% <ø> (ø)
typescript 56.63% <ø> (+0.17%) ⬆️
unit-tests 56.63% <ø> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR bounds runaway old-heap growth in Shape.Consumer processes (issue #4476) via two opt-in, inert-by-default levers: per-process spawn_opt (for fullsweep_after etc. via ELECTRIC_PROCESS_SPAWN_OPTS) and a runtime-tunable adaptive GC that fires after a transaction fragment once the consumer heap exceeds ELECTRIC_CONSUMER_GC_HEAP_THRESHOLD. Iteration 2 resolves the one Important issue from the previous review — the synchronous GC on the replication critical path is now bounded by hysteresis and documented. This is a clean, production-minded change; I have no remaining blocking concerns.

What's Working Well

  • The critical-path concern is properly addressed. @gc_min_interval_ms (1 s) plus the new last_forced_gc_at state field cap forced full sweeps to at most once per second per consumer, even when a busy consumer sits persistently above the threshold or drains many buffered fragments back-to-back. The nil fast-path still short-circuits before any process_info/time call. The trade-off is now spelled out in both the moduledoc on set_gc_heap_threshold/2 and the maybe_garbage_collect/1 comment, steering operators toward a conservative (high) threshold and toward the lower-risk fullsweep_after lever for steady state.
  • Clean separation of decision from effect. should_force_gc?/5 is a pure predicate (heap-over-threshold AND interval-elapsed) with min_interval_ms injectable, so the hysteresis is unit-tested deterministically without wall-clock dependence — covering the never-GC'd case (last_gc_at == nil), the under-interval case, the elapsed case, and the exact-boundary (>=) case.
  • Previous suggestions picked up. over_heap_threshold?/2, should_force_gc?/5, and the IEx helpers all now carry @spec; the buffer-drain GC path and the all-stacks success-counting fix landed in earlier commits.
  • Strong overall test coverage — boundary semantics of over_heap_threshold?/2 (strict >, word/byte conversion), positive/negative/default-off GC tests, a trace-based test proving GC fires during the buffer drain, spawn-opt propagation into the live consumer process, and config-propagation into tweaks.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None remaining. The iteration-1 Important issue (synchronous GC on the publish path with no rate limit) is resolved by the hysteresis + documentation in consumer.ex:572-604 / consumer.ex:615-641.

Suggestions (Nice to Have)

  • @gc_min_interval_ms is compile-time only. The heap threshold is runtime-tunable from IEx, but the hysteresis interval is a fixed module attribute, so it can't be loosened/tightened live alongside the threshold during an incident. 1 s is a sensible default and probably fine to leave hardcoded — just flagging the asymmetry in case live tuning of the interval is ever wanted.
  • Hysteresis means heap can still grow within the 1 s window during a large drain. If a buffered drain pushes through many large-payload fragments inside a single second, only the first over-threshold fragment triggers a sweep, so the heap can climb until the interval elapses. This is the correct trade-off (protecting publish latency wins over tighter heap bounding for an opt-in backstop) and is well-documented; no change needed — noting it so the bound is understood as "≈ threshold + one second of allocations," not a hard cap.
  • Negative env threshold still only rejected at stack init. ELECTRIC_CONSUMER_GC_HEAP_THRESHOLD parses as :integer (runtime.exs:290), so a negative value passes env parsing and is only caught later by the {:or, [:non_neg_integer, nil]} NimbleOptions schema, failing stack startup. Fail-fast is acceptable; a custom parser would give a clearer boundary error. (Carried over, unchanged — not a blocker.)

Issue Conformance

PR correctly declares Closes #4476, and the linked issue is now available in context. The implementation matches the diagnosed root cause precisely: consumers inheriting fullsweep_after: 65535, rarely hibernating, accumulating unreclaimable old-heap garbage. The issue is well-specified (per-node evidence, heap/live-state ratios, proposed fix). Scope is tight and matches the stated goal — Consumer, Snapshotter, and Materializer all get the spawn-opt lever as the issue suggested; request-handler processes are intentionally and correctly left untouched. No scope creep.

Previous Review Status

Iteration 1 → 2. Resolved:

  • Important: synchronous GC on the replication critical path — now bounded by @gc_min_interval_ms hysteresis (should_force_gc?/5 + last_forced_gc_at) and documented with operator guidance.
  • Suggestion: @spec on the public IEx helpers and over_heap_threshold?/2.

Still open (both minor, by-design): per-fragment StackConfig.lookup on the hot path when enabled (deliberate, for live tunability), and negative-threshold validation deferred to stack init.


Review iteration: 2 | 2026-06-09

@alco alco self-assigned this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shape consumers accumulate multi-MB heaps of unreclaimed floating garbage (default fullsweep_after, rarely hibernate)

2 participants