Skip to content

fix(consensus): address 4 tier-1 stability issues from audit#652

Open
keanji-x wants to merge 8 commits intomainfrom
fix/audit-tier1-stability-fixes
Open

fix(consensus): address 4 tier-1 stability issues from audit#652
keanji-x wants to merge 8 commits intomainfrom
fix/audit-tier1-stability-fixes

Conversation

@keanji-x
Copy link
Copy Markdown
Contributor

@keanji-x keanji-x commented Apr 14, 2026

Summary

Fixes 4 production stability bugs identified in the gravity-audit:

Closes Galxe/gravity-audit#62
Closes Galxe/gravity-audit#54
Closes Galxe/gravity-audit#55
Closes Galxe/gravity-audit#65

These are the "Tier 1" issues from audit triage — bugs that can be triggered without any malicious behavior (user submitting large tx, disk hiccup, task leak, component crash).

Changes

File Fix
quorum_store/batch_generator.rs Handle num_batch_txns == 0 case: drain oversized tx, warn, continue
pipeline/buffer_manager.rs Add Some(Err(e)) arm to persisting_phase select
pipeline/buffer_manager.rs Add Instant::now() + 30s deadline to reset spin loop
quorum_store/network_listener.rs .expect().is_err() + error!() + break

Test plan

  • Verify existing consensus tests pass
  • Manual review of each fix for correctness
  • The pre-existing disable_lifo_slot build error in upstream gravity-aptos dependency is unrelated

🤖 Generated with Claude Code

keanji-x and others added 2 commits April 14, 2026 09:02
… changes

Sync genesis config with upstream gravity_chain_core_contracts (15 commits
behind). The Solidity contracts introduced breaking ABI changes that caused
the cluster genesis scripts and all test configs to generate incompatible
validator_genesis.json.

Breaking changes addressed:
- Rename autoEvictThreshold (uint256) → autoEvictThresholdPct (uint64):
  eviction criterion changed from absolute proposal count to success-rate
  percentage (0-100)
- Remove minimumProposalStake from StakingConfig.initialize(): field
  deprecated with storage gap for hardfork compatibility
- Add genesisTimestampSecs optional field support in aggregator

Additional fixes:
- Fix stale default callback address in aggregate_genesis.py (625F2018 →
  625F4001)
- Fix default consensusConfig from empty 0x00 to correct BCS-encoded value
- Add missing trusted_source_id to genesis.toml.example and vfn test config
- Fix bridge test using outdated callback address
- Patch genesis.sh to make genesis-tool a standalone workspace when running
  inside git worktrees

Verified: full init → genesis → deploy → start cycle passes with upstream
contracts, node produces blocks and RPC responds correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- #62: Fix infinite loop in `push_bucket_to_batches` when an oversized
  transaction (exceeding `sender_max_batch_bytes`) heads the queue.
  The oversized tx is now skipped with a warning instead of spinning forever.

- #54: Handle persist-phase errors in `buffer_manager` tokio::select!.
  Previously `Some(Err(_))` was silently consumed; now it is logged
  so operators can diagnose pipeline stalls.

- #55: Add 30-second timeout to `BufferManager::reset()` spin loop.
  Prevents permanent deadlock if a task is leaked and `ongoing_tasks`
  never reaches zero.

- #65: Replace `.expect()` on channel sends in quorum store
  `network_listener` with graceful error handling. A dropped receiver
  now logs an error and breaks the loop instead of panic-cascading
  across all quorum store components.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
keanji-x and others added 3 commits April 14, 2026 09:55
The error logging fix for #54 prevents silent error consumption, but the
pipeline can still stall because committed blocks are already popped from
the buffer. Add a TODO noting that a pipeline reset should be considered
for full recovery.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment