feat(client): resumable merkle upload (auto-load cached payment receipt) by grumbach · Pull Request #84 · WithAutonomi/ant-client

grumbach · 2026-05-12T09:49:59Z

Summary

Persists the merkle batch payment receipt to disk between the on-chain payment and the chunk-store phase. If the store phase fails (network flake, slow close-K, client crash, machine reboot), the next invocation transparently loads the cached receipt and resumes from where it left off, instead of forcing the user to pay again.

This is the next merkle fix on the path to "merkle works on prod", layered on top of:

ant-node #89 — storer-side closeness K=32 + timeout 240s
ant-node #91 — clock-skew tolerance 60s → 300s
Eviction trio: saorsa-core #114 / ant-node #90 / ant-client #77
Mick's stability bundle — saorsa-transport #79, saorsa-core #115, ant-protocol #5, ant-node #93, ant-client #78
ant-client #83 — merkle store-response timeout 10s → 270s

This PR depends on Mick's stability bundle merging first (it's based off fix/stability-improvements).

Motivation

Live prod merkle upload of a 730 MB file on 2026-05-12 paid 2.78 ANT on-chain, stored 28 of 178 chunks, then died with Stored on 1 peers, need 4 on a chunk whose close-K had 4 unresponsive storers. The on-chain payment was unrecoverable because the proofs and merkle salts only lived in process memory. Re-uploading the same file would have required paying another 2.78 ANT.

Prod nodes will keep being intermittently slow / dead for the foreseeable future, so the right answer is to make merkle uploads survive a partial failure instead of trying to make every node always respond in time. (Mick's stability bundle already takes a swing at the latter.)

Design

A new private module ant-core/src/data/client/cached_merkle is responsible for:

try_save(file_path, &MerkleBatchPaymentResult) — called immediately after the on-chain payment confirms, before the first chunk PUT. Writes JSON to <data_dir>/payments/<timestamp>_<file_hash>.
try_load_for_file(file_path) — called at the top of every merkle upload. If a non-expired cached receipt exists, the upload skips the pay phase entirely. Logs "Found previous merkle upload attempt for <path>, resuming with payment cached at <path>".
try_delete_for_file(file_path) — called after a fully successful upload. The cache is per-file, so a future re-upload of the same path pays anew.
cleanup_outdated() — called opportunistically on every load. Reaps receipts whose merkle_payment_timestamp is past the 7-day on-chain expiry.

Filename = {timestamp}_{file_hash} where timestamp is the payment timestamp used on-chain (so expiry matches the contract) and file_hash is DefaultHasher of the source path (collisions are content-validated against current chunk addresses, so a stale hit is detected and dropped).

All error paths in the cache module are best-effort: serialization failures, IO errors, missing dirs all degrade to "no resume". A busted cache directory never breaks an upload.

Behaviour change

First successful upload of a file: identical to today, plus a brief disk write between pay and store, and a brief disk delete after store. Both are wrapped in try_* so failures are logged but not fatal.
Upload that fails after payment: the receipt sits in <data_dir>/payments/. The user can leave it (and re-try the upload) or delete it manually if they want a fresh pay. Logged path is in the warning message so it's discoverable.
Re-upload after a failure: the run logs "Found previous merkle upload attempt..." and skips straight to store. No CLI flag, no opt-in.
File modified between failure and retry: the cached receipt's proofs won't cover the new chunk addresses; the upload detects the mismatch, deletes the stale receipt, and pays fresh.
Receipt older than 7 days: garbage-collected on next load; no resume possible.

Tests

Unit: file_hash_key stability, expired/fresh filename detection, roundtrip save → load → delete, malformed filenames not auto-deleted.
Live prod: 60 MB / 19-chunk merkle upload completed successfully (0.125 ANT, ~2 min) on this build; payments dir was empty before and after, confirming the save+delete cycle.
Pre-payment failure path (merkle candidate quote timeout): observed in a separate run — no cache file was written because the MerkleBatchPaymentResult was never constructed. Correct behaviour: nothing to resume, no garbage on disk.

Future work

The fallback-to-single-node path on InsufficientPeers does not currently use the cache (single-node uploads are per-chunk and re-quoting is cheap). If this becomes a hotspot we can plug it into the same module.
The cache directory is per-machine. Cross-machine resume would need the user to copy the JSON file between machines. Out of scope.
We don't yet track which individual chunks were stored in a previous attempt — on resume, every chunk's PUT is re-attempted. Storers return AlreadyExists cheaply if they already have the chunk, but the wave still spends bandwidth re-PUTting the payload. A future iteration could persist the set of stored addresses alongside the receipt.

…KUP_TIMEOUT+padding Bump DEFAULT_STORE_TIMEOUT_SECS from 10s to 270s for merkle batch PUTs. The storer-side merkle payment verifier runs an iterative DHT lookup with CLOSENESS_LOOKUP_TIMEOUT = 240s (ant-node, post-PR #89). The old 10s client-side timeout fired long before the storer could finish verifying, with three downstream costs: 1. The storer keeps working on a chunk the client has already discarded, wasting CPU and bandwidth. 2. The client re-targets a different close-K member and may double-store the same chunk on a different peer set. 3. Cross-region close-K membership (sgp1 / syd1 storers serving a lon1 client) makes this happen on virtually every merkle chunk, not just a tail. Set client timeout = storer timeout + 30s padding (store-response RTT + storer-local LMDB put/fsync + clock skew tolerance). Invariant: client store-response timeout >= node CLOSENESS_LOOKUP_TIMEOUT + padding. Re-validate if either side's value changes.

…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.

… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.

Setting GITHUB_TOKEN on foundry-toolchain@v1 didn't help: foundryup itself does not read $GITHUB_TOKEN before calling api.github.com to resolve the nightly tag, so macOS runners on shared egress IPs still hit the 60/h anonymous rate limit and 403 every install. Pin to v1.3.6 (last stable as of 2026-05-12) and curl the release tarball directly from the GitHub Releases CDN. Release assets are served from a CDN unaffected by the API rate limit. Unpacks anvil/forge/cast/chisel to /usr/local/bin and verifies versions.

When a merkle batch upload fails partway through (network flake, slow close-K, client crash), the on-chain payment is lost but the proofs needed to re-attempt the store are lost too — the user has to pay again from scratch. This change persists the MerkleBatchPaymentResult to disk immediately after the on-chain payment confirms, then re-loads it on the next upload of the same file path. The cache is keyed by a hash of the source path; a successful upload deletes the cache, a partial failure leaves it for the next attempt to pick up. Files older than the on-chain payment expiration (7 days) are GC'd opportunistically. The library handles save/load/delete transparently — no CLI flag and no app-level change needed. If the cached receipt doesn't match the current file content (file edited between attempts), the cache is discarded and the user pays fresh. Foundation laid by adding Serialize/Deserialize to MerkleBatchPaymentResult and threading the on-chain payment timestamp through. The new module also handles its own failure modes defensively: any IO/serialization error is logged but never bubbled up to break the upload itself. Cache misses are silent.

grumbach added 6 commits May 12, 2026 16:47

ci: pin foundry to v1.7.1 (latest stable) for direct-download install

7e8f596

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(client): resumable merkle upload (auto-load cached payment receipt)#84

feat(client): resumable merkle upload (auto-load cached payment receipt)#84
grumbach wants to merge 6 commits into
WithAutonomi:fix/stability-improvementsfrom
grumbach:grumbach/resumable-merkle-upload

grumbach commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grumbach commented May 12, 2026

Summary

Motivation

Design

Behaviour change

Tests

Future work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant