Skip to content

feat(client): resumable merkle upload (auto-load cached payment receipt)#84

Open
grumbach wants to merge 6 commits into
WithAutonomi:fix/stability-improvementsfrom
grumbach:grumbach/resumable-merkle-upload
Open

feat(client): resumable merkle upload (auto-load cached payment receipt)#84
grumbach wants to merge 6 commits into
WithAutonomi:fix/stability-improvementsfrom
grumbach:grumbach/resumable-merkle-upload

Conversation

@grumbach
Copy link
Copy Markdown
Contributor

Summary

Persists the merkle batch payment receipt to disk between the on-chain payment and the chunk-store phase. If the store phase fails (network flake, slow close-K, client crash, machine reboot), the next invocation transparently loads the cached receipt and resumes from where it left off, instead of forcing the user to pay again.

This is the next merkle fix on the path to "merkle works on prod", layered on top of:

  • ant-node #89 — storer-side closeness K=32 + timeout 240s
  • ant-node #91 — clock-skew tolerance 60s → 300s
  • Eviction trio: saorsa-core #114 / ant-node #90 / ant-client #77
  • Mick's stability bundle — saorsa-transport #79, saorsa-core #115, ant-protocol #5, ant-node #93, ant-client #78
  • ant-client #83 — merkle store-response timeout 10s → 270s

This PR depends on Mick's stability bundle merging first (it's based off fix/stability-improvements).

Motivation

Live prod merkle upload of a 730 MB file on 2026-05-12 paid 2.78 ANT on-chain, stored 28 of 178 chunks, then died with Stored on 1 peers, need 4 on a chunk whose close-K had 4 unresponsive storers. The on-chain payment was unrecoverable because the proofs and merkle salts only lived in process memory. Re-uploading the same file would have required paying another 2.78 ANT.

Prod nodes will keep being intermittently slow / dead for the foreseeable future, so the right answer is to make merkle uploads survive a partial failure instead of trying to make every node always respond in time. (Mick's stability bundle already takes a swing at the latter.)

Design

A new private module ant-core/src/data/client/cached_merkle is responsible for:

  • try_save(file_path, &MerkleBatchPaymentResult) — called immediately after the on-chain payment confirms, before the first chunk PUT. Writes JSON to <data_dir>/payments/<timestamp>_<file_hash>.
  • try_load_for_file(file_path) — called at the top of every merkle upload. If a non-expired cached receipt exists, the upload skips the pay phase entirely. Logs "Found previous merkle upload attempt for <path>, resuming with payment cached at <path>".
  • try_delete_for_file(file_path) — called after a fully successful upload. The cache is per-file, so a future re-upload of the same path pays anew.
  • cleanup_outdated() — called opportunistically on every load. Reaps receipts whose merkle_payment_timestamp is past the 7-day on-chain expiry.

Filename = {timestamp}_{file_hash} where timestamp is the payment timestamp used on-chain (so expiry matches the contract) and file_hash is DefaultHasher of the source path (collisions are content-validated against current chunk addresses, so a stale hit is detected and dropped).

All error paths in the cache module are best-effort: serialization failures, IO errors, missing dirs all degrade to "no resume". A busted cache directory never breaks an upload.

Behaviour change

  • First successful upload of a file: identical to today, plus a brief disk write between pay and store, and a brief disk delete after store. Both are wrapped in try_* so failures are logged but not fatal.
  • Upload that fails after payment: the receipt sits in <data_dir>/payments/. The user can leave it (and re-try the upload) or delete it manually if they want a fresh pay. Logged path is in the warning message so it's discoverable.
  • Re-upload after a failure: the run logs "Found previous merkle upload attempt..." and skips straight to store. No CLI flag, no opt-in.
  • File modified between failure and retry: the cached receipt's proofs won't cover the new chunk addresses; the upload detects the mismatch, deletes the stale receipt, and pays fresh.
  • Receipt older than 7 days: garbage-collected on next load; no resume possible.

Tests

  • Unit: file_hash_key stability, expired/fresh filename detection, roundtrip save → load → delete, malformed filenames not auto-deleted.
  • Live prod: 60 MB / 19-chunk merkle upload completed successfully (0.125 ANT, ~2 min) on this build; payments dir was empty before and after, confirming the save+delete cycle.
  • Pre-payment failure path (merkle candidate quote timeout): observed in a separate run — no cache file was written because the MerkleBatchPaymentResult was never constructed. Correct behaviour: nothing to resume, no garbage on disk.

Future work

  • The fallback-to-single-node path on InsufficientPeers does not currently use the cache (single-node uploads are per-chunk and re-quoting is cheap). If this becomes a hotspot we can plug it into the same module.
  • The cache directory is per-machine. Cross-machine resume would need the user to copy the JSON file between machines. Out of scope.
  • We don't yet track which individual chunks were stored in a previous attempt — on resume, every chunk's PUT is re-attempted. Storers return AlreadyExists cheaply if they already have the chunk, but the wave still spends bandwidth re-PUTting the payload. A future iteration could persist the set of stored addresses alongside the receipt.

grumbach added 6 commits May 12, 2026 16:47
…KUP_TIMEOUT+padding

Bump DEFAULT_STORE_TIMEOUT_SECS from 10s to 270s for merkle batch PUTs.

The storer-side merkle payment verifier runs an iterative DHT lookup
with CLOSENESS_LOOKUP_TIMEOUT = 240s (ant-node, post-PR #89). The
old 10s client-side timeout fired long before the storer could
finish verifying, with three downstream costs:

1. The storer keeps working on a chunk the client has already
   discarded, wasting CPU and bandwidth.
2. The client re-targets a different close-K member and may
   double-store the same chunk on a different peer set.
3. Cross-region close-K membership (sgp1 / syd1 storers serving
   a lon1 client) makes this happen on virtually every merkle
   chunk, not just a tail.

Set client timeout = storer timeout + 30s padding (store-response
RTT + storer-local LMDB put/fsync + clock skew tolerance).

Invariant: client store-response timeout >= node
CLOSENESS_LOOKUP_TIMEOUT + padding. Re-validate if either side's
value changes.
…anged

Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads
store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent.

This commit:
- Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const
- Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270)
- Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field
- Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged
- Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const)
- Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value

Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions
(merkle PUT / non-merkle PUT / GET) will be cleanly separated.
… limit

foundryup curls api.github.com to resolve the nightly tag. Anonymous
calls are rate-limited at 60/hour shared per IP; macOS runners hit this
regularly and fail every E2E and Merkle E2E job with
`curl: (56) ... 403`.

Passing the workflow's GITHUB_TOKEN authenticates the call, raising the
cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
Setting GITHUB_TOKEN on foundry-toolchain@v1 didn't help: foundryup
itself does not read $GITHUB_TOKEN before calling api.github.com to
resolve the nightly tag, so macOS runners on shared egress IPs still
hit the 60/h anonymous rate limit and 403 every install.

Pin to v1.3.6 (last stable as of 2026-05-12) and curl the release
tarball directly from the GitHub Releases CDN. Release assets are
served from a CDN unaffected by the API rate limit. Unpacks
anvil/forge/cast/chisel to /usr/local/bin and verifies versions.
When a merkle batch upload fails partway through (network flake, slow
close-K, client crash), the on-chain payment is lost but the proofs
needed to re-attempt the store are lost too — the user has to pay
again from scratch.

This change persists the MerkleBatchPaymentResult to disk
immediately after the on-chain payment confirms, then re-loads it on
the next upload of the same file path. The cache is keyed by a hash
of the source path; a successful upload deletes the cache, a partial
failure leaves it for the next attempt to pick up. Files older than
the on-chain payment expiration (7 days) are GC'd opportunistically.

The library handles save/load/delete transparently — no CLI flag and
no app-level change needed. If the cached receipt doesn't match the
current file content (file edited between attempts), the cache is
discarded and the user pays fresh.

Foundation laid by adding Serialize/Deserialize to
MerkleBatchPaymentResult and threading the on-chain payment
timestamp through.

The new module also handles its own failure modes defensively: any
IO/serialization error is logged but never bubbled up to break the
upload itself. Cache misses are silent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant