Skip to content

Add validator sync-lag duty gate #689

@tcoratger

Description

@tcoratger

Validators on lagging nodes keep signing duties against stale heads. Their attestations land in fork choice as weight on the wrong subtree, pulling the network away from the canonical head. This issue adds a sync-lag gate on validator duties: skip attestation and proposal when the local head trails wall clock by more than SYNC_LAG_THRESHOLD slots.

Companion to #688 (BlocksByRange), which addresses the catch-up speed itself. The two are independent and can land in either order, but together they unblock devnet stalls.

Problem

src/lean_spec/subspecs/validator/service.py has no sync-lag check. A node 800 slots behind continues to attest and propose against its stale head.

Consequences:

  • Stale-head attestations deposit LMD-GHOST weight on the wrong subtree, slowing convergence on the canonical head.
  • This bites lean harder than beacon. Faster finality means fork choice has less time to absorb noise.
  • Lagging validators waste their own attestations and accrue inclusion-distance / wrong-head penalties.

A node can be SyncState.SYNCED per the state machine but still many slots behind wall clock during a brief network hiccup, validator restart, or partition. The right signal is wall-clock lag against the local head, not the binary sync-state flag.

Proposed gate

Skip both attestation and proposal duties when local head lags wall clock by more than SYNC_LAG_THRESHOLD = 4 slots.

# src/lean_spec/subspecs/validator/service.py
SYNC_LAG_THRESHOLD: Final[int] = 4

def is_synced_for_duties(store: Store, wall_clock_slot: Slot) -> bool:
    """Return False if the node is too far behind to safely sign duties."""
    head_slot = store.blocks[store.head].slot
    if wall_clock_slot <= head_slot:
        return True
    return (wall_clock_slot - head_slot) <= SYNC_LAG_THRESHOLD
  • Applies to attestation and proposal. Both pollute fork choice or chain history when produced from a stale view.
  • Decision uses store.head, not justified or finalized. The gate is about whether the validator's view of current head is fresh.
  • wall_clock_slot < head_slot (clock skew) does NOT gate. Trust the chain over the wall clock in that direction.
  • Threshold of 4 slots is a starting value. Generous enough to absorb normal gossip jitter, tight enough to silence validators on materially-stale nodes.
  • On skip, emit a structured log distinguishing "skipped, unsynced" from "no duty this slot" so operators can attribute missed duties correctly.

Implementation checklist

Stage 1 — Helper and constant

  • Add SYNC_LAG_THRESHOLD: Final[int] = 4 to validator/config.py (create if missing) or validator/registry.py.
  • Add is_synced_for_duties(store, wall_clock_slot) helper in validator/service.py.

Stage 2 — Gate the duties

  • Gate attestation duty entry point with early return on not is_synced_for_duties(...).
  • Gate proposal duty entry point with the same check.
  • Structured log on each skip: include head_slot, wall_clock_slot, lag.

Stage 3 — Operator visibility

  • Counter for skipped-due-to-lag attestations and proposals (separate from "no duty this slot").
  • Surface the counter in the existing observability subspec.

Test plan

tests/lean_spec/subspecs/validator/test_service.py

  • is_synced_for_duties returns True when wall_clock_slot - head_slot <= 4
  • is_synced_for_duties returns False when wall_clock_slot - head_slot > 4
  • is_synced_for_duties returns True when wall_clock_slot < head_slot (clock skew edge case)
  • is_synced_for_duties boundary: lag == 4 is allowed; lag == 5 is gated
  • Attestation entry point early-returns without signing when gate is False
  • Proposal entry point early-returns without signing when gate is False
  • Gate clears the slot after head advances within threshold (transitions in both directions)
  • Skip emits the structured log with head_slot, wall_clock_slot, lag
  • Skip increments the lag-skip counter; "no duty" path does not

tests/consensus/ (spec fixtures)

  • fork_choice_test: validator under gate produces no attestation; canonical head unchanged by its absence
  • fork_choice_test: same scenario with gate disabled would have produced a stale-head attestation (negative control documenting the bug this fixes)

Out of scope

  • No fork-choice-level filter dropping attestations from clearly-unsynced peers. Defense-in-depth follow-up; track separately if needed.
  • No change to SyncState semantics. The gate is independent of the state machine.
  • No changes to slashing detection or doppelganger protection.
  • Catch-up speed is tracked in Add BlocksByRange req/resp protocol #688 (BlocksByRange).

Open questions

  • Threshold of 4: stay flat, or scale with SECONDS_PER_SLOT and finality cadence?
  • Should the gate also apply to aggregate-attestation duty (aggregator selection) and sync-committee-style duties if added later?
  • Should there be a hysteresis band (e.g., gate at lag > 4 but only un-gate at lag <= 2) to avoid flapping near the threshold?
  • Do we surface the gate state via the validator API for external monitoring, or is the log + counter sufficient?

References

  • Validator service: src/lean_spec/subspecs/validator/service.py.
  • Sync state machine (independent of this gate): src/lean_spec/subspecs/sync/states.py.
  • Companion proposal: Add BlocksByRange req/resp protocol #688.
  • Beacon prior art: client-side is_syncing checks before attesting (Lighthouse, Prysm, Teku) — informal, not in the consensus spec itself.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions