Skip to content

feat(network): trigger_sync on BFT SyncNeeded event (backlog #4)#178

Merged
satyakwok merged 1 commit intomainfrom
feat/auto-sync-trigger
Apr 20, 2026
Merged

feat(network): trigger_sync on BFT SyncNeeded event (backlog #4)#178
satyakwok merged 1 commit intomainfrom
feat/auto-sync-trigger

Conversation

@satyakwok
Copy link
Copy Markdown
Member

Summary

First step on the auto-resync track (backlog #4). Today the
NodeEvent::SyncNeeded handler in main.rs only logged the event —
it didn't actually request any blocks. BFT RoundStatus gossip would
tell us a peer was at a higher height, and we'd wait up to 30s for
the periodic sync_interval tick to do anything about it. Exactly
the gap we hit on testnet all day today when a validator fell behind
post-restart: chain stalled while the lagged peer slowly trickled
blocks in via the 30s tick.

Changes:

  • SwarmCommand::TriggerSync — new command; fires an immediate
    GetBlocks { from_height: our_height + 1 } to the first verified
    peer.
  • LibP2pNode::trigger_sync() — public handle.
  • main.rs — event handler now calls it. Needed an extra
    lp2p.clone() so the handle survives into the spawned event-loop
    closure.

Scope

This closes the "peer restarted and is a few blocks behind" case.
Still doesn't close the deeper state-divergence case (rogue
blocks, trie mismatches) — those need the manual chain.db rsync
workaround documented in the session handoff. True auto-resync
(fallback-reconnect when peer blocks fail add_block_from_peer
validation) is the follow-up.

Test plan

  • cargo clippy --workspace --tests -- -D warnings clean
  • cargo test --workspace — 38 suites pass
  • CI green
  • Deploy on testnet; restart one validator deliberately; observe
    libp2p trigger_sync: requested blocks from ... starting at N
    log within 1s of SyncNeeded, not 30s.

Today the NodeEvent::SyncNeeded handler only logged — it didn't
actually do anything. BFT RoundStatus gossip tells us a peer is at
a higher height, but we'd wait up to 30s for the periodic sync_interval
tick before actually asking for blocks. That's exactly the gap we hit
on testnet all day today when a validator fell behind post-restart —
the others kept BFT-stalling at the lagged peer's height while sync
trickled in too slowly.

Changes:
- `SwarmCommand::TriggerSync` — new command, fires an immediate
  `GetBlocks { from_height: our_height + 1 }` to the first verified
  peer. Same payload as the 30s periodic tick, just on demand.
- `LibP2pNode::trigger_sync()` — public handle exposing the command.
- main.rs `NodeEvent::SyncNeeded` handler — calls it. Took an extra
  `lp2p.clone()` so the handle survives into the tokio::spawn closure.

Doesn't fix the deeper state-divergence case (rogue blocks, trie
mismatches) — that still needs the rsync workaround documented in
the session handoff. But for the common 'a peer restarted and is a
few blocks behind' case, closing the 30s-to-notice window keeps the
chain advancing instead of nil-skipping for half a minute.

Next follow-up: if trigger_sync fires and the peer responds with
blocks that fail add_block_from_peer validation (rogue chain), log
CRITICAL and fall back to a backoff-reconnect path. That's the true
auto-resync; this PR is the first step.
@satyakwok satyakwok merged commit c5be2a8 into main Apr 20, 2026
2 checks passed
@satyakwok satyakwok deleted the feat/auto-sync-trigger branch April 20, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant