feat(network): trigger_sync on BFT SyncNeeded event (backlog #4)#178
Merged
feat(network): trigger_sync on BFT SyncNeeded event (backlog #4)#178
Conversation
Today the NodeEvent::SyncNeeded handler only logged — it didn't
actually do anything. BFT RoundStatus gossip tells us a peer is at
a higher height, but we'd wait up to 30s for the periodic sync_interval
tick before actually asking for blocks. That's exactly the gap we hit
on testnet all day today when a validator fell behind post-restart —
the others kept BFT-stalling at the lagged peer's height while sync
trickled in too slowly.
Changes:
- `SwarmCommand::TriggerSync` — new command, fires an immediate
`GetBlocks { from_height: our_height + 1 }` to the first verified
peer. Same payload as the 30s periodic tick, just on demand.
- `LibP2pNode::trigger_sync()` — public handle exposing the command.
- main.rs `NodeEvent::SyncNeeded` handler — calls it. Took an extra
`lp2p.clone()` so the handle survives into the tokio::spawn closure.
Doesn't fix the deeper state-divergence case (rogue blocks, trie
mismatches) — that still needs the rsync workaround documented in
the session handoff. But for the common 'a peer restarted and is a
few blocks behind' case, closing the 30s-to-notice window keeps the
chain advancing instead of nil-skipping for half a minute.
Next follow-up: if trigger_sync fires and the peer responds with
blocks that fail add_block_from_peer validation (rogue chain), log
CRITICAL and fall back to a backoff-reconnect path. That's the true
auto-resync; this PR is the first step.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First step on the auto-resync track (backlog #4). Today the
NodeEvent::SyncNeededhandler in main.rs only logged the event —it didn't actually request any blocks. BFT RoundStatus gossip would
tell us a peer was at a higher height, and we'd wait up to 30s for
the periodic
sync_intervaltick to do anything about it. Exactlythe gap we hit on testnet all day today when a validator fell behind
post-restart: chain stalled while the lagged peer slowly trickled
blocks in via the 30s tick.
Changes:
SwarmCommand::TriggerSync— new command; fires an immediateGetBlocks { from_height: our_height + 1 }to the first verifiedpeer.
LibP2pNode::trigger_sync()— public handle.lp2p.clone()so the handle survives into the spawned event-loopclosure.
Scope
This closes the "peer restarted and is a few blocks behind" case.
Still doesn't close the deeper state-divergence case (rogue
blocks, trie mismatches) — those need the manual chain.db rsync
workaround documented in the session handoff. True auto-resync
(fallback-reconnect when peer blocks fail
add_block_from_peervalidation) is the follow-up.
Test plan
cargo clippy --workspace --tests -- -D warningscleancargo test --workspace— 38 suites passlibp2p trigger_sync: requested blocks from ... starting at Nlog within 1s of SyncNeeded, not 30s.