refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline by prestwich · Pull Request #57 · init4tech/storage

prestwich · 2026-04-17T04:10:55Z

Summary

Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:

Write-drain is not cancellable. A single stuck reader holds its permit, blocks acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.
Two backpressure mechanisms (256-slot mpsc + 64-permit semaphore) overlap without a purpose. The channel buffer is 4× the in-flight cap.
No handler-side operation deadline. At 12s write cadence, a single stuck reader extends every drain and eventually re-fills the write mpsc — regenerating the exact production failure PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56 fixes.

Design

See docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.md on the design discussion thread (PR #56 comment).

Permit-attached messages. ColdStorageHandle acquires a semaphore permit before sending; the permit travels in PermittedReadRequest and is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count so try_send on the handle side is infallible (modulo shutdown).

Split task runner. run_dispatcher pulls PermittedReadRequests and spawns handlers. run_writer consumes writes sequentially, drains via acquire_many_owned(64) wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.

Per-request deadline. ColdStorageTask::with_read_deadline(Duration) (default 5s) wraps each non-stream handler in tokio::time::timeout. On expiry the caller receives ColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.

Tests

crates/cold/tests/concurrency.rs expanded with a GatedBackend helper that blocks every read call on a test-controlled semaphore:

reads_above_concurrency_cap_do_not_deadlock (carried over)
write_after_saturating_reads_makes_progress (carried over)
fairness_write_serves_before_later_readers (new) — verifies tokio FIFO fairness keeps the writer ahead of later readers
cancel_during_reader_backpressure_shuts_down (new)
cancel_during_write_drain_shuts_down (new) — would fail without the cancel-select on the writer's drain
operation_deadline_releases_permit (new) — verifies Timeout is returned and the permit rejoins the pool

Behavioral note

UnifiedStorage::append_blocks dispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code at components/crates/node-tests/src/context.rs:380-393 already polls for cold to catch up; two in-repo unit tests (append_and_read_back, drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.

Test plan

cargo test -p signet-cold (conformance + 6 concurrency tests)
cargo test --workspace
cargo +nightly fmt -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo clippy --workspace --all-targets --no-default-features -- -D warnings
RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to dev mainnet, confirm no backpressure-induced crashes over a full day.

🤖 Generated with Claude Code

Moves semaphore permit acquisition to `ColdStorageHandle` so permits travel with read requests into the channel. The task runner splits into two concurrent subtasks: - **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers, wrapping each in a per-request deadline (default 5s). - **Writer**: consumes writes sequentially. Drain-before-write uses `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select so shutdown cannot hang on a stuck reader. The semaphore is now the single backpressure mechanism. The read channel is sized to match permit count, so `try_send` from a caller holding a permit is guaranteed to have capacity. New `ColdStorageError::Timeout` is returned to callers whose handler exceeds the deadline; dropping the handler future releases its permit back to the pool, so a stuck backend call self-heals. Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend` helper and four new regression cases: fairness under saturation, cancel during reader backpressure, cancel during write drain, and operation-deadline permit release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`UnifiedStorage::append_blocks` dispatches to cold asynchronously. With the cold task's dispatcher and writer now running on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read — the tests that assumed one were relying on an implementation detail that production code already polls around (see `components/crates/node-tests/src/context.rs`). Add a `wait_for_cold_height` helper matching the production pattern and use it in the two tests that issued a read immediately after `append_blocks`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prestwich · 2026-04-17T04:15:36Z

[Claude Code]

@rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands.

@Fraser999 @Evalir requesting review.

prestwich and others added 2 commits April 17, 2026 00:09

prestwich requested review from Evalir and Fraser999 April 17, 2026 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor

prestwich commented Apr 17, 2026

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prestwich commented Apr 17, 2026

Summary

Design

Tests

Behavioral note

Test plan

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant