feat(network): trigger_sync on BFT SyncNeeded event (backlog #4) by satyakwok · Pull Request #178 · sentrix-labs/sentrix

satyakwok · 2026-04-20T14:33:38Z

Summary

First step on the auto-resync track (backlog #4). Today the
NodeEvent::SyncNeeded handler in main.rs only logged the event —
it didn't actually request any blocks. BFT RoundStatus gossip would
tell us a peer was at a higher height, and we'd wait up to 30s for
the periodic sync_interval tick to do anything about it. Exactly
the gap we hit on testnet all day today when a validator fell behind
post-restart: chain stalled while the lagged peer slowly trickled
blocks in via the 30s tick.

Changes:

SwarmCommand::TriggerSync — new command; fires an immediate
GetBlocks { from_height: our_height + 1 } to the first verified
peer.
LibP2pNode::trigger_sync() — public handle.
main.rs — event handler now calls it. Needed an extra
lp2p.clone() so the handle survives into the spawned event-loop
closure.

Scope

This closes the "peer restarted and is a few blocks behind" case.
Still doesn't close the deeper state-divergence case (rogue
blocks, trie mismatches) — those need the manual chain.db rsync
workaround documented in the session handoff. True auto-resync
(fallback-reconnect when peer blocks fail add_block_from_peer
validation) is the follow-up.

Test plan

cargo clippy --workspace --tests -- -D warnings clean
cargo test --workspace — 38 suites pass
CI green
Deploy on testnet; restart one validator deliberately; observe
libp2p trigger_sync: requested blocks from ... starting at N
log within 1s of SyncNeeded, not 30s.

Today the NodeEvent::SyncNeeded handler only logged — it didn't actually do anything. BFT RoundStatus gossip tells us a peer is at a higher height, but we'd wait up to 30s for the periodic sync_interval tick before actually asking for blocks. That's exactly the gap we hit on testnet all day today when a validator fell behind post-restart — the others kept BFT-stalling at the lagged peer's height while sync trickled in too slowly. Changes: - `SwarmCommand::TriggerSync` — new command, fires an immediate `GetBlocks { from_height: our_height + 1 }` to the first verified peer. Same payload as the 30s periodic tick, just on demand. - `LibP2pNode::trigger_sync()` — public handle exposing the command. - main.rs `NodeEvent::SyncNeeded` handler — calls it. Took an extra `lp2p.clone()` so the handle survives into the tokio::spawn closure. Doesn't fix the deeper state-divergence case (rogue blocks, trie mismatches) — that still needs the rsync workaround documented in the session handoff. But for the common 'a peer restarted and is a few blocks behind' case, closing the 30s-to-notice window keeps the chain advancing instead of nil-skipping for half a minute. Next follow-up: if trigger_sync fires and the peer responds with blocks that fail add_block_from_peer validation (rogue chain), log CRITICAL and fall back to a backoff-reconnect path. That's the true auto-resync; this PR is the first step.

satyakwok merged commit c5be2a8 into main Apr 20, 2026
2 checks passed

satyakwok deleted the feat/auto-sync-trigger branch April 20, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(network): trigger_sync on BFT SyncNeeded event (backlog #4)#178

feat(network): trigger_sync on BFT SyncNeeded event (backlog #4)#178
satyakwok merged 1 commit intomainfrom
feat/auto-sync-trigger

satyakwok commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented Apr 20, 2026

Summary

Scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant