Implement Devnet-4 metrics from leanMetrics PR #29#289
Conversation
Add block production metrics (building time, payload aggregation time, aggregated payload count, success/failure counters), gossip message size histograms (block, attestation, aggregation), node sync status gauge (idle/syncing/synced), and update committee signature aggregation histogram buckets to wider range.
🤖 Codex Code Review
No consensus-critical changes to fork choice, attestation validation, STF, XMSS verification, or SSZ handling stood out in this diff beyond the observability issues above. I could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Kimi Code ReviewOverall this is a solid metrics PR with good use of RAII timing guards and proper error path coverage. However, there are several correctness and maintainability issues to address: Critical Issues
Correctness & Maintainability
Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Claude Code ReviewNow I have all the context needed. Here is my review: PR #289 Review — Devnet-4 MetricsOverviewThis PR adds Prometheus instrumentation for block production timing/outcomes, node sync status, and gossip message sizes. The implementation is clean and follows existing patterns. A few correctness and scope issues are worth addressing. Correctness Issues1.
The metric description says "Time taken to build a block," but gossip publication is included. A slow network path will inflate this histogram and make it misleading. Either rename the metric to 2. Naming mismatch on The block in 3.
4. Sync status only transitions when a block is received
Minor Issues5. Gossip size histograms are not eagerly registered
6. std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()This panics if the system clock is set before 1970, which is unrealistic in production but is still a panic path in consensus code. Use 7.
What Looks Good
Suggested Priority
Automated review by Claude (Anthropic) · sonnet · custom prompt |
Greptile SummaryThis PR adds Devnet-4 monitoring metrics: block production histograms/counters, a labelled sync-status gauge, and gossip message-size histograms, plus a bucket range update for the existing committee-aggregation histogram. Implementation correctly follows the codebase's Confidence Score: 5/5Safe to merge; only one P2 style comment about timing-scope labelling. All findings are P2 (style/clarity). No logic errors, data loss, or runtime failures were identified. The sync-status gauge correctly uses saturating_sub for pre-genesis safety, genesis_time is u64 so the *1000 multiplication is safe, and all metric exit paths are correctly instrumented. Gossip metrics are consistently lazily registered, matching the existing p2p pattern. crates/blockchain/src/lib.rs — the timing scope of lean_block_building_time_seconds is wider than its description implies.
|
| Filename | Overview |
|---|---|
| crates/blockchain/src/lib.rs | Adds timing guard for block building, success/failure counters on all exit paths, and sync-status update based on head-slot vs wall-clock slot; timing scope for lean_block_building_time_seconds is broader than the metric description suggests (includes process_block + publish_block) |
| crates/blockchain/src/metrics.rs | Adds 5 block-production metrics (2 histograms, 1 aggregated-payload histogram, 2 counters), 1 sync-status IntGaugeVec, and updates existing aggregation bucket range; all registered via init() and follow existing LazyLock/TimingGuard patterns |
| crates/blockchain/src/store.rs | Wraps build_block() call with payload-aggregation timing guard and observes aggregated-payload count after successful build; correctly only records count on success path |
| crates/net/p2p/src/gossipsub/handler.rs | Observes decompressed message size for block, aggregation, and attestation gossip types immediately after successful decompression, consistent with measuring uncompressed payload size |
| crates/net/p2p/src/metrics.rs | Adds three gossip-size histograms (block, attestation, aggregation) with appropriate bucket ranges; lazily initialized on first use, consistent with existing p2p metrics pattern (no explicit init()) |
Sequence Diagram
sequenceDiagram
participant BC as BlockChainServer
participant Metrics as blockchain/metrics
participant Store as store.rs
participant KM as KeyManager
participant P2P as P2PServer
BC->>Metrics: time_block_building() [_timing guard START]
BC->>Store: produce_block_with_signatures()
activate Store
Store->>Metrics: time_block_building_payload_aggregation() [guard START]
Store->>Store: build_block()
Store->>Metrics: [payload aggregation guard DROP → records duration]
Store->>Metrics: observe_block_aggregated_payloads(signatures.len())
Store-->>BC: (Block, signatures, checkpoints) OR Err
deactivate Store
alt build failed
BC->>Metrics: inc_block_building_failures()
else success
BC->>KM: sign_block_root()
alt sign failed
BC->>Metrics: inc_block_building_failures()
else signed
BC->>BC: process_block() [state transition + storage]
BC->>Metrics: update_head_slot / set_node_sync_status
alt process failed
BC->>Metrics: inc_block_building_failures()
else processed
BC->>Metrics: inc_block_building_success()
BC->>P2P: publish_block()
end
end
end
BC->>Metrics: [_timing guard DROP → records lean_block_building_time_seconds]
Note over P2P,Metrics: On gossip receive
P2P->>P2P: decompress_message()
P2P->>Metrics: observe_gossip_block_size(uncompressed.len())
P2P->>P2P: observe_gossip_attestation_size / observe_gossip_aggregation_size
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 221
Comment:
**Timing scope wider than metric name implies**
`_timing` is dropped at the end of `propose_block`, so `lean_block_building_time_seconds` captures not only block construction and signing but also the full state transition inside `process_block()` (which includes SSZ processing, fork-choice updates, and storage writes) and the gossip `publish_block` call. The nested `lean_block_building_payload_aggregation_time_seconds` already isolates the `build_block()` phase, so the outer metric's scope may surprise operators trying to attribute latency — the description in the metric string says "Time taken to build a block" but it measures the entire proposal pipeline.
Consider either tightening the timing scope to end before `process_block` / `publish_block`, or updating the metric description to `"Time taken to propose a block (build + sign + process + publish)"` to make the breadth explicit.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Implement Devnet-4 metrics from leanMetr..." | Re-trigger Greptile
| fn propose_block(&mut self, slot: u64, validator_id: u64) { | ||
| info!(%slot, %validator_id, "We are the proposer for this slot"); | ||
|
|
||
| let _timing = metrics::time_block_building(); |
There was a problem hiding this comment.
Timing scope wider than metric name implies
_timing is dropped at the end of propose_block, so lean_block_building_time_seconds captures not only block construction and signing but also the full state transition inside process_block() (which includes SSZ processing, fork-choice updates, and storage writes) and the gossip publish_block call. The nested lean_block_building_payload_aggregation_time_seconds already isolates the build_block() phase, so the outer metric's scope may surprise operators trying to attribute latency — the description in the metric string says "Time taken to build a block" but it measures the entire proposal pipeline.
Consider either tightening the timing scope to end before process_block / publish_block, or updating the metric description to "Time taken to propose a block (build + sign + process + publish)" to make the breadth explicit.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 221
Comment:
**Timing scope wider than metric name implies**
`_timing` is dropped at the end of `propose_block`, so `lean_block_building_time_seconds` captures not only block construction and signing but also the full state transition inside `process_block()` (which includes SSZ processing, fork-choice updates, and storage writes) and the gossip `publish_block` call. The nested `lean_block_building_payload_aggregation_time_seconds` already isolates the `build_block()` phase, so the outer metric's scope may surprise operators trying to attribute latency — the description in the metric string says "Time taken to build a block" but it measures the entire proposal pipeline.
Consider either tightening the timing scope to end before `process_block` / `publish_block`, or updating the metric description to `"Time taken to propose a block (build + sign + process + publish)"` to make the breadth explicit.
How can I resolve this? If you propose a fix, please make it concise.| } | ||
|
|
||
| /// Set the node sync status. Sets the given status label to 1 and all others to 0. | ||
| pub fn set_node_sync_status(status: &str) { |
There was a problem hiding this comment.
We should use an enum for this
There was a problem hiding this comment.
Done, added a SyncStatus enum with Idle/Syncing/Synced variants.
| let genesis_time_ms = self.store.config().genesis_time * 1000; | ||
| let now_ms = std::time::SystemTime::now() | ||
| .duration_since(std::time::UNIX_EPOCH) | ||
| .unwrap() | ||
| .as_millis() as u64; | ||
| let current_slot = now_ms.saturating_sub(genesis_time_ms) / MILLISECONDS_PER_SLOT; | ||
| let status = if head_slot >= current_slot { | ||
| "synced" | ||
| } else { | ||
| "syncing" | ||
| }; |
There was a problem hiding this comment.
This can be simplified to self.store.time() / INTERVALS_PER_SLOT < self.store.head_slot()
There was a problem hiding this comment.
Good call, replaced the manual wall clock calculation with self.store.time() / INTERVALS_PER_SLOT.
Address PR #289 review comments: replace raw &str sync status with a SyncStatus enum, and replace manual SystemTime wall clock calculation with store.time() / INTERVALS_PER_SLOT.
Motivation
Implements the metrics defined in leanEthereum/leanMetrics#29 for Devnet-4 monitoring: block production, gossip message sizes, sync status, and updated histogram buckets.
Description
Block Production Metrics (
blockchain/metrics.rs→ instrumented inlib.rs+store.rs)lean_block_building_time_secondslean_block_building_payload_aggregation_time_secondslean_block_aggregated_payloadslean_block_building_success_totallean_block_building_failures_totalpropose_block()is wrapped with a timing guard for total block building time, and increments success/failure counters on each exit path.produce_block_with_signatures()times thebuild_block()call specifically for payload aggregation, and observes the aggregated payload count.Sync Status (
blockchain/metrics.rs→ tracked inlib.rs)lean_node_sync_statusidleat startup (before first tick).syncingorsyncedafter every block, by comparinghead_slotagainst the wall clock slot.Gossip Message Size Metrics (
p2p/metrics.rs→ instrumented ingossipsub/handler.rs)lean_gossip_block_size_byteslean_gossip_attestation_size_byteslean_gossip_aggregation_size_bytesModified Existing Metric
lean_committee_signatures_aggregation_time_seconds: buckets updated from[0.005..1.0]to[0.05..4.0]to capture longer aggregation times in Devnet-4.How to Test
make fmt— passesmake lint— passes (clippy clean)make test— spec test failures are pre-existing (fixture format mismatch, unrelated to this PR)/metricsendpoint (port 5054)