feat(client): resumable merkle upload (auto-load cached payment receipt)#84
Open
grumbach wants to merge 6 commits into
Open
Conversation
…KUP_TIMEOUT+padding Bump DEFAULT_STORE_TIMEOUT_SECS from 10s to 270s for merkle batch PUTs. The storer-side merkle payment verifier runs an iterative DHT lookup with CLOSENESS_LOOKUP_TIMEOUT = 240s (ant-node, post-PR #89). The old 10s client-side timeout fired long before the storer could finish verifying, with three downstream costs: 1. The storer keeps working on a chunk the client has already discarded, wasting CPU and bandwidth. 2. The client re-targets a different close-K member and may double-store the same chunk on a different peer set. 3. Cross-region close-K membership (sgp1 / syd1 storers serving a lon1 client) makes this happen on virtually every merkle chunk, not just a tail. Set client timeout = storer timeout + 30s padding (store-response RTT + storer-local LMDB put/fsync + clock skew tolerance). Invariant: client store-response timeout >= node CLOSENESS_LOOKUP_TIMEOUT + padding. Re-validate if either side's value changes.
…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.
… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
Setting GITHUB_TOKEN on foundry-toolchain@v1 didn't help: foundryup itself does not read $GITHUB_TOKEN before calling api.github.com to resolve the nightly tag, so macOS runners on shared egress IPs still hit the 60/h anonymous rate limit and 403 every install. Pin to v1.3.6 (last stable as of 2026-05-12) and curl the release tarball directly from the GitHub Releases CDN. Release assets are served from a CDN unaffected by the API rate limit. Unpacks anvil/forge/cast/chisel to /usr/local/bin and verifies versions.
When a merkle batch upload fails partway through (network flake, slow close-K, client crash), the on-chain payment is lost but the proofs needed to re-attempt the store are lost too — the user has to pay again from scratch. This change persists the MerkleBatchPaymentResult to disk immediately after the on-chain payment confirms, then re-loads it on the next upload of the same file path. The cache is keyed by a hash of the source path; a successful upload deletes the cache, a partial failure leaves it for the next attempt to pick up. Files older than the on-chain payment expiration (7 days) are GC'd opportunistically. The library handles save/load/delete transparently — no CLI flag and no app-level change needed. If the cached receipt doesn't match the current file content (file edited between attempts), the cache is discarded and the user pays fresh. Foundation laid by adding Serialize/Deserialize to MerkleBatchPaymentResult and threading the on-chain payment timestamp through. The new module also handles its own failure modes defensively: any IO/serialization error is logged but never bubbled up to break the upload itself. Cache misses are silent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Persists the merkle batch payment receipt to disk between the on-chain payment and the chunk-store phase. If the store phase fails (network flake, slow close-K, client crash, machine reboot), the next invocation transparently loads the cached receipt and resumes from where it left off, instead of forcing the user to pay again.
This is the next merkle fix on the path to "merkle works on prod", layered on top of:
This PR depends on Mick's stability bundle merging first (it's based off
fix/stability-improvements).Motivation
Live prod merkle upload of a 730 MB file on 2026-05-12 paid 2.78 ANT on-chain, stored 28 of 178 chunks, then died with
Stored on 1 peers, need 4on a chunk whose close-K had 4 unresponsive storers. The on-chain payment was unrecoverable because the proofs and merkle salts only lived in process memory. Re-uploading the same file would have required paying another 2.78 ANT.Prod nodes will keep being intermittently slow / dead for the foreseeable future, so the right answer is to make merkle uploads survive a partial failure instead of trying to make every node always respond in time. (Mick's stability bundle already takes a swing at the latter.)
Design
A new private module
ant-core/src/data/client/cached_merkleis responsible for:try_save(file_path, &MerkleBatchPaymentResult)— called immediately after the on-chain payment confirms, before the first chunk PUT. Writes JSON to<data_dir>/payments/<timestamp>_<file_hash>.try_load_for_file(file_path)— called at the top of every merkle upload. If a non-expired cached receipt exists, the upload skips the pay phase entirely. Logs"Found previous merkle upload attempt for <path>, resuming with payment cached at <path>".try_delete_for_file(file_path)— called after a fully successful upload. The cache is per-file, so a future re-upload of the same path pays anew.cleanup_outdated()— called opportunistically on every load. Reaps receipts whosemerkle_payment_timestampis past the 7-day on-chain expiry.Filename =
{timestamp}_{file_hash}wheretimestampis the payment timestamp used on-chain (so expiry matches the contract) andfile_hashisDefaultHasherof the source path (collisions are content-validated against current chunk addresses, so a stale hit is detected and dropped).All error paths in the cache module are best-effort: serialization failures, IO errors, missing dirs all degrade to "no resume". A busted cache directory never breaks an upload.
Behaviour change
try_*so failures are logged but not fatal.<data_dir>/payments/. The user can leave it (and re-try the upload) or delete it manually if they want a fresh pay. Logged path is in the warning message so it's discoverable."Found previous merkle upload attempt..."and skips straight to store. No CLI flag, no opt-in.Tests
file_hash_keystability, expired/fresh filename detection, roundtrip save → load → delete, malformed filenames not auto-deleted.MerkleBatchPaymentResultwas never constructed. Correct behaviour: nothing to resume, no garbage on disk.Future work
InsufficientPeersdoes not currently use the cache (single-node uploads are per-chunk and re-quoting is cheap). If this becomes a hotspot we can plug it into the same module.AlreadyExistscheaply if they already have the chunk, but the wave still spends bandwidth re-PUTting the payload. A future iteration could persist the set of stored addresses alongside the receipt.