Skip to content

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57

Open
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor
Open

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor

Conversation

@prestwich
Copy link
Copy Markdown
Member

Summary

Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:

  1. Write-drain is not cancellable. A single stuck reader holds its permit, blocks acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.
  2. Two backpressure mechanisms (256-slot mpsc + 64-permit semaphore) overlap without a purpose. The channel buffer is 4× the in-flight cap.
  3. No handler-side operation deadline. At 12s write cadence, a single stuck reader extends every drain and eventually re-fills the write mpsc — regenerating the exact production failure PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56 fixes.

Design

See docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.md on the design discussion thread (PR #56 comment).

Permit-attached messages. ColdStorageHandle acquires a semaphore permit before sending; the permit travels in PermittedReadRequest and is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count so try_send on the handle side is infallible (modulo shutdown).

Split task runner. run_dispatcher pulls PermittedReadRequests and spawns handlers. run_writer consumes writes sequentially, drains via acquire_many_owned(64) wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.

Per-request deadline. ColdStorageTask::with_read_deadline(Duration) (default 5s) wraps each non-stream handler in tokio::time::timeout. On expiry the caller receives ColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.

Tests

crates/cold/tests/concurrency.rs expanded with a GatedBackend helper that blocks every read call on a test-controlled semaphore:

  • reads_above_concurrency_cap_do_not_deadlock (carried over)
  • write_after_saturating_reads_makes_progress (carried over)
  • fairness_write_serves_before_later_readers (new) — verifies tokio FIFO fairness keeps the writer ahead of later readers
  • cancel_during_reader_backpressure_shuts_down (new)
  • cancel_during_write_drain_shuts_down (new) — would fail without the cancel-select on the writer's drain
  • operation_deadline_releases_permit (new) — verifies Timeout is returned and the permit rejoins the pool

Behavioral note

UnifiedStorage::append_blocks dispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code at components/crates/node-tests/src/context.rs:380-393 already polls for cold to catch up; two in-repo unit tests (append_and_read_back, drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.

Test plan

  • cargo test -p signet-cold (conformance + 6 concurrency tests)
  • cargo test --workspace
  • cargo +nightly fmt -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo clippy --workspace --all-targets --no-default-features -- -D warnings
  • RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
  • Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to dev mainnet, confirm no backpressure-induced crashes over a full day.

🤖 Generated with Claude Code

prestwich and others added 2 commits April 17, 2026 00:09
Moves semaphore permit acquisition to `ColdStorageHandle` so permits
travel with read requests into the channel. The task runner splits
into two concurrent subtasks:

- **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers,
  wrapping each in a per-request deadline (default 5s).
- **Writer**: consumes writes sequentially. Drain-before-write uses
  `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select
  so shutdown cannot hang on a stuck reader.

The semaphore is now the single backpressure mechanism. The read
channel is sized to match permit count, so `try_send` from a caller
holding a permit is guaranteed to have capacity.

New `ColdStorageError::Timeout` is returned to callers whose handler
exceeds the deadline; dropping the handler future releases its permit
back to the pool, so a stuck backend call self-heals.

Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend`
helper and four new regression cases: fairness under saturation,
cancel during reader backpressure, cancel during write drain, and
operation-deadline permit release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`UnifiedStorage::append_blocks` dispatches to cold asynchronously.
With the cold task's dispatcher and writer now running on separate
subtasks, there is no biased ordering between a fire-and-forget
write and a subsequent read — the tests that assumed one were
relying on an implementation detail that production code already
polls around (see `components/crates/node-tests/src/context.rs`).

Add a `wait_for_cold_height` helper matching the production pattern
and use it in the two tests that issued a read immediately after
`append_blocks`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prestwich prestwich requested review from Evalir and Fraser999 April 17, 2026 04:15
@prestwich
Copy link
Copy Markdown
Member Author

[Claude Code]

@rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands.

@Fraser999 @Evalir requesting review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant