fix(bft): proposer rebroadcasts proposal up to 3× at 3s intervals (#1d) by satyakwok · Pull Request #175 · sentrix-labs/sentrix

satyakwok · 2026-04-20T12:53:56Z

Summary

Closes the real root cause of the 2026-04-20 testnet livelock. PR
#174's 10→20s propose_timeout bump alone didn't help — tally logs
kept firing every ~25 seconds.

Cross-validator trace at a stuck height:

T+0.0s  proposer broadcasts Proposal
T+0.0s  2 peers in verified_peers receive it → prevote block
T+12s   prevote_timeout fires on those 2 → nil-precommit
        (needed 3-of-4 supermajority; only 2 votes for block)
T+21s   third peer finally hits propose_timeout, nil-prevotes
        too late — other 2 already in precommit phase
→ round skipped, repeat forever

Longer propose_timeout widened the gap between fast and slow peers
instead of closing it. Proper fix: proposer retries the broadcast
every 3s up to 3 times while still in Propose phase, so peers that
just reconnected (and weren't in verified_peers at initial send)
still receive the proposal before their own propose-phase expires.

3s × 3 retries = up to 9s of retry window inside the 20s propose
phase. Plenty of headroom for reconnect stragglers, without spamming
the network.

Implementation

validator loop tracks proposal_broadcast_at + proposal_rebroadcast_count
tick loop checks: if we are proposer AND phase == Propose AND we
have a proposed_block AND last broadcast ≥ 3s ago AND count < 3
→ re-sign + re-broadcast, bump count.
reset both on new height; reset both at each late-round propose
site (skip/timeout re-proposals).
emits BFT #1d: rebroadcast proposal... per retry so the fix is
observable in journalctl.

Test plan

cargo clippy --workspace --tests -- -D warnings clean
cargo test --workspace — all 38 suites pass
CI green
Deploy to testnet; bake for 1 hour; check fire rate of
BFT #1d: logs (should drop sharply) and look for
BFT #1d: rebroadcast proposal lines (proves retry is
actually firing).

Diagnosis: debug(bft): log block_hash + nil-skip tally for #1d investigation #171 added the tally logging that exposed the shape.
Partial-fix attempt: fix(bft): bump propose timeout 10→20s to close #1d livelock window #174 bumped the timeout; this PR completes
what that one couldn't.
Follow-up (not in this PR): peer-subscribe on newly-verified
peers so the proposer rebroadcasts the moment a peer joins
mid-round, instead of waiting for the retry timer.

v2.1.3's PROPOSE_TIMEOUT bump 10s → 20s didn't close #1d — the testnet tally logs (`BFT #1d: ... tally=[nil=4.5B]`) kept firing at ~72 times per 30 min. Cross-validator trace at the stuck heights showed the real shape: T+0.0s proposer broadcasts Proposal T+0.0s 2 peers in verified_peers receive it, prevote block T+12s prevote_timeout fires on those 2 peers → nil-precommit (supermajority-for-block needed 3, only 2 arrived) T+21s third peer finally hits propose_timeout, nil-prevotes way too late — other 2 are already in precommit phase → round skipped, repeat Longer propose_timeout widened the gap between fast and slow peers instead of closing it. Real fix: proposer retries the broadcast every few seconds while still in Propose phase, so peers that just reconnected (and weren't in `verified_peers` at the initial send) can still receive the proposal before their own propose-phase expires. Implementation: - validator loop tracks `proposal_broadcast_at` + `rebroadcast_count` - every tick, if we are proposer AND phase == Propose AND have a proposed block AND ≥ 3s since last broadcast AND count < 3: reconstruct Proposal from stored block, re-sign, re-broadcast, bump count. - reset both on new height; reset both at each propose site (late- round skip/timeout re-proposals) via the ones that set `proposed_block = Some(block)`. - emits an `info!("BFT #1d: rebroadcast proposal...")` per retry so the fix is observable in journalctl. 3s / 3 tries gives a ceiling of 9s of retries inside the 20s propose window — plenty of room for a peer that takes a few seconds to reconnect, without spamming. Root-cause sequel of #171 (logging) + #174 (timeout bump). Left further hardening as a #1d follow-up: peer-subscribe on newly-verified peers so the proposer gets notified the moment a new peer joins mid-round, instead of waiting for the retry timer.

satyakwok merged commit 51633c5 into main Apr 20, 2026
2 checks passed

satyakwok deleted the fix/bft-1d-proposer-rebroadcast branch April 20, 2026 13:01

satyakwok mentioned this pull request Apr 20, 2026

fix(bft): extend #1d rebroadcast 3×3s → 7×2s + cover Prevote — v2.1.4 #176

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bft): proposer rebroadcasts proposal up to 3× at 3s intervals (#1d)#175

fix(bft): proposer rebroadcasts proposal up to 3× at 3s intervals (#1d)#175
satyakwok merged 1 commit intomainfrom
fix/bft-1d-proposer-rebroadcast

satyakwok commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented Apr 20, 2026

Summary

Implementation

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant