Skip to content

feat(upload): instrument per-chunk retries and store wall-clock#87

Merged
jacderida merged 1 commit into
WithAutonomi:mainfrom
jacderida:feat-upload-retry-instrumentation
May 12, 2026
Merged

feat(upload): instrument per-chunk retries and store wall-clock#87
jacderida merged 1 commit into
WithAutonomi:mainfrom
jacderida:feat-upload-retry-instrumentation

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

Summary

Track per-chunk store-RPC attempt counts and wall-clock durations through the upload pipeline so testnet runs can distinguish client-side quorum/retry cost from network or storage cost when investigating upload slowdowns.

Motivation: a recent testnet showed aggregate upload throughput halving over 12 hours before any node failures, with large-file uploaders degrading ~5x and small-file uploaders only ~1.5x. The file-size sensitivity strongly implicates per-chunk client-side cost (CLOSE_GROUP quorum, slowest-peer-dominates), but neither chunk-retry counts nor per-chunk wall-clock were observable. This change adds both.

Changes

  • ant-core WaveResult and FileUploadResult gain chunk_attempts_total, store_durations_ms, and retries_histogram fields. FileUploadResult is #[non_exhaustive] so this is non-breaking.
  • A new WaveAggregateStats helper in batch.rs folds multiple WaveResults into one upload-level summary; threaded through batch_upload_chunks_with_events, upload_waves_single, upload_waves_merkle, and merkle_upload_chunks.
  • ant-cli --json output for file upload exposes the new fields so downstream tooling can record them without log parsing.
  • One structured info!("chunk_store_wave_complete", ...) log line per wave with p50/p95/max store durations and per-round retry counts, for log-based analysis.

Test plan

  • cargo check --workspace clean
  • cargo clippy --workspace --tests -- -D warnings clean
  • Existing ant-core batch tests pass (cargo test -p ant-core --lib data::client::batch)
  • Run an upload locally with ant --json file upload --public <file> and verify the new JSON fields appear

Generated with Claude Code

Track per-chunk attempt counts and store-RPC wall-clock through the
upload pipeline so testnet runs can identify when slowdowns are
client-side quorum/retry cost vs network or storage cost.

Surface on FileUploadResult and ant-cli --json output:
- chunk_attempts_total: sum of store-RPC attempts (>= chunks_stored)
- store_durations_ms: per-chunk wall-clock from first attempt to success
- retries_histogram: how many stored chunks needed N retries

Also emit a structured "chunk_store_wave_complete" info log per wave
with p50/p95/max durations and per-round retry counts, for log-based
analysis without --json parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the feat-upload-retry-instrumentation branch from a300fea to fd6121b Compare May 12, 2026 20:18
@jacderida jacderida merged commit a416ff3 into WithAutonomi:main May 12, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant