Skip to content

fix(phlower): cap invocation detail for high-rate tasks, prevent OOM on flush failure#23

Merged
webjunkie merged 1 commit intomainfrom
fix/detail-cap-and-flush-safety
Apr 28, 2026
Merged

fix(phlower): cap invocation detail for high-rate tasks, prevent OOM on flush failure#23
webjunkie merged 1 commit intomainfrom
fix/detail-cap-and-flush-safety

Conversation

@webjunkie
Copy link
Copy Markdown
Contributor

@webjunkie webjunkie commented Apr 28, 2026

Detail cap for high-rate tasks — for tasks above 500/min (DETAIL_RATE_THRESHOLD), args/kwargs/traceback are not stored for successes. Failures and retries always keep full detail regardless of rate. Aggregates unaffected.

Flush failure safety — on SQLite flush error (e.g. disk full), drained records are now removed from self.invocations instead of being retained permanently. Previously this caused ~1.5 GB/hr RSS growth until OOM because records were popped from the pending queue but never cleaned up.

Context — the 17 GB SQLite file on 20 GB PVC caused hourly purge WAL spikes to briefly hit 100% disk, failing subsequent flushes and triggering the RSS leak. The detail cap reduces write volume for the heaviest tasks, and the flush safety prevents memory blowup when writes do fail.

…on flush failure

Two changes to prevent disk-pressure-driven OOM cycles:

1. Skip storing args/kwargs/traceback for successful invocations of tasks
   exceeding DETAIL_RATE_THRESHOLD (default 500/min). The core invocation
   row (timestamps, runtime, worker, state) is always written — only the
   heavy detail fields are omitted. Failures and retries always keep full
   detail. Reduces detail table growth proportional to the highest-rate
   tasks without losing aggregate accuracy or invocation visibility.

2. On SQLite flush failure, remove the drained records from self.invocations
   instead of retaining them. Previously, disk-full errors caused records to
   leak in memory permanently (popped from _sqlite_pending but never removed
   from invocations), driving RSS growth at ~1.5 GB/hr until OOM.
@webjunkie webjunkie requested a review from Copilot April 28, 2026 05:45
@webjunkie
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@webjunkie webjunkie merged commit b3868e5 into main Apr 28, 2026
10 checks passed
@webjunkie webjunkie deleted the fix/detail-cap-and-flush-safety branch April 28, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant