refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
Open
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
Open
Conversation
Moves semaphore permit acquisition to `ColdStorageHandle` so permits travel with read requests into the channel. The task runner splits into two concurrent subtasks: - **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers, wrapping each in a per-request deadline (default 5s). - **Writer**: consumes writes sequentially. Drain-before-write uses `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select so shutdown cannot hang on a stuck reader. The semaphore is now the single backpressure mechanism. The read channel is sized to match permit count, so `try_send` from a caller holding a permit is guaranteed to have capacity. New `ColdStorageError::Timeout` is returned to callers whose handler exceeds the deadline; dropping the handler future releases its permit back to the pool, so a stuck backend call self-heals. Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend` helper and four new regression cases: fairness under saturation, cancel during reader backpressure, cancel during write drain, and operation-deadline permit release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`UnifiedStorage::append_blocks` dispatches to cold asynchronously. With the cold task's dispatcher and writer now running on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read — the tests that assumed one were relying on an implementation detail that production code already polls around (see `components/crates/node-tests/src/context.rs`). Add a `wait_for_cold_height` helper matching the production pattern and use it in the two tests that issued a read immediately after `append_blocks`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
[Claude Code] @rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands. @Fraser999 @Evalir requesting review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:
acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.Design
See
docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.mdon the design discussion thread (PR #56 comment).Permit-attached messages.
ColdStorageHandleacquires a semaphore permit before sending; the permit travels inPermittedReadRequestand is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count sotry_sendon the handle side is infallible (modulo shutdown).Split task runner.
run_dispatcherpullsPermittedReadRequests and spawns handlers.run_writerconsumes writes sequentially, drains viaacquire_many_owned(64)wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.Per-request deadline.
ColdStorageTask::with_read_deadline(Duration)(default 5s) wraps each non-stream handler intokio::time::timeout. On expiry the caller receivesColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.Tests
crates/cold/tests/concurrency.rsexpanded with aGatedBackendhelper that blocks every read call on a test-controlled semaphore:reads_above_concurrency_cap_do_not_deadlock(carried over)write_after_saturating_reads_makes_progress(carried over)fairness_write_serves_before_later_readers(new) — verifies tokio FIFO fairness keeps the writer ahead of later readerscancel_during_reader_backpressure_shuts_down(new)cancel_during_write_drain_shuts_down(new) — would fail without the cancel-select on the writer's drainoperation_deadline_releases_permit(new) — verifies Timeout is returned and the permit rejoins the poolBehavioral note
UnifiedStorage::append_blocksdispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code atcomponents/crates/node-tests/src/context.rs:380-393already polls for cold to catch up; two in-repo unit tests (append_and_read_back,drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.Test plan
cargo test -p signet-cold(conformance + 6 concurrency tests)cargo test --workspacecargo +nightly fmt -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo clippy --workspace --all-targets --no-default-features -- -D warningsRUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-depssignet-cold+signet-storage, bumpinit4tech/node-components, rebuildsignet-sidecar:latest, redeploy to devmainnet, confirm no backpressure-induced crashes over a full day.🤖 Generated with Claude Code