fix(consensus): address 4 tier-1 stability issues from audit#652
Open
fix(consensus): address 4 tier-1 stability issues from audit#652
Conversation
… changes Sync genesis config with upstream gravity_chain_core_contracts (15 commits behind). The Solidity contracts introduced breaking ABI changes that caused the cluster genesis scripts and all test configs to generate incompatible validator_genesis.json. Breaking changes addressed: - Rename autoEvictThreshold (uint256) → autoEvictThresholdPct (uint64): eviction criterion changed from absolute proposal count to success-rate percentage (0-100) - Remove minimumProposalStake from StakingConfig.initialize(): field deprecated with storage gap for hardfork compatibility - Add genesisTimestampSecs optional field support in aggregator Additional fixes: - Fix stale default callback address in aggregate_genesis.py (625F2018 → 625F4001) - Fix default consensusConfig from empty 0x00 to correct BCS-encoded value - Add missing trusted_source_id to genesis.toml.example and vfn test config - Fix bridge test using outdated callback address - Patch genesis.sh to make genesis-tool a standalone workspace when running inside git worktrees Verified: full init → genesis → deploy → start cycle passes with upstream contracts, node produces blocks and RPC responds correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- #62: Fix infinite loop in `push_bucket_to_batches` when an oversized transaction (exceeding `sender_max_batch_bytes`) heads the queue. The oversized tx is now skipped with a warning instead of spinning forever. - #54: Handle persist-phase errors in `buffer_manager` tokio::select!. Previously `Some(Err(_))` was silently consumed; now it is logged so operators can diagnose pipeline stalls. - #55: Add 30-second timeout to `BufferManager::reset()` spin loop. Prevents permanent deadlock if a task is leaked and `ongoing_tasks` never reaches zero. - #65: Replace `.expect()` on channel sends in quorum store `network_listener` with graceful error handling. A dropped receiver now logs an error and breaks the loop instead of panic-cascading across all quorum store components. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 14, 2026
The error logging fix for #54 prevents silent error consumption, but the pipeline can still stall because committed blocks are already popped from the buffer. Add a TODO noting that a pipeline reset should be considered for full recovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ByteYue
approved these changes
Apr 15, 2026
nekomoto911
approved these changes
Apr 15, 2026
nekomoto911
approved these changes
Apr 15, 2026
nekomoto911
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes 4 production stability bugs identified in the gravity-audit:
push_bucket_to_batcheswhen oversized tx heads the queue → skip with warningtokio::select!silently drops persist-phase errors → log the errorBufferManager::reset()spin loop has no timeout → 30s deadline to prevent deadlock.expect()on channel sends cascades panic across quorum store → graceful error + breakCloses Galxe/gravity-audit#62
Closes Galxe/gravity-audit#54
Closes Galxe/gravity-audit#55
Closes Galxe/gravity-audit#65
These are the "Tier 1" issues from audit triage — bugs that can be triggered without any malicious behavior (user submitting large tx, disk hiccup, task leak, component crash).
Changes
quorum_store/batch_generator.rsnum_batch_txns == 0case: drain oversized tx, warn, continuepipeline/buffer_manager.rsSome(Err(e))arm to persisting_phase selectpipeline/buffer_manager.rsInstant::now() + 30sdeadline to reset spin loopquorum_store/network_listener.rs.expect()→.is_err()+error!()+breakTest plan
disable_lifo_slotbuild error in upstreamgravity-aptosdependency is unrelated🤖 Generated with Claude Code