fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding by grumbach · Pull Request #83 · WithAutonomi/ant-client

grumbach · 2026-05-12T05:48:48Z

Summary

Adds merkle_store_timeout_secs (default 270 s) to ClientConfig and routes the merkle batch PUT response timeout through it.

Storer-side merkle payment verifier runs an iterative DHT lookup with CLOSENESS_LOOKUP_TIMEOUT = 240 s (ant-node, post-PR #89). The old 10 s default fired long before the storer could finish verifying, with three downstream costs:

The storer keeps working on a chunk the client has already discarded, wasting CPU and bandwidth.
The client re-targets a different close-K member and may double-store the same chunk on a different peer set.
Cross-region close-K membership (sgp1 / syd1 storers serving a lon1 client) makes this happen on virtually every merkle chunk, not just a tail.

This is the 5th merkle failure mode — orthogonal to A+G (#89), clock-skew (#91), and the eviction trio (#114/#90/#77).

Scope: merkle PUT only

This PR intentionally does not touch the non-merkle PUT path (already uses the STORE_RESPONSE_TIMEOUT const in chunk.rs, currently 30 s) or the chunk GET path (currently reads store_timeout_secs, which stays at 10 s — same as main).

The GET-path rename is Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both PRs land, the three timeout regions (merkle PUT / non-merkle PUT / GET) are cleanly separated:

Path	Knob	Default
merkle PUT response	`merkle_store_timeout_secs` (this PR)	270 s
non-merkle PUT response	`STORE_RESPONSE_TIMEOUT` const	30 s (10 s post Mick #78)
chunk GET response	`chunk_get_timeout_secs` (Mick #78)	10 s

Reasoning for 270 s

Client timeout = storer timeout + 30 s padding (store-response RTT + storer-local LMDB put / fsync + clock-skew tolerance). 240 + 30 = 270.

Invariant: client merkle store-response timeout >= node CLOSENESS_LOOKUP_TIMEOUT + padding. Pinned by the new default_merkle_store_timeout_satisfies_storer_invariant regression test so it cannot silently re-break.

Empirical validation

Ran a 210-node 7-region testnet (lon1, fra1, sgp1, nyc1, ams3, tor1, syd1) on a branch combining main + Mick's fix/stability-improvements + the eviction trio (#114, #90, #77):

`merkle_store_timeout_secs`	Multi-region merkle uploads
10 s (current main)	0 / 22 pass — every chunk times out
270 s (this PR)	9 / 9 pass — every iteration completes

Failure signature on the unpatched side:

Failed to store chunk on <peer-id>: timeout: Timeout waiting for store response after 10s
[...]
Error: Upload failed: 2/16 stored, 1 failed:
  insufficient peers: Stored on 1 peers, need 4

Findings doc in the saorsa repo under notes/anselme-testnet-2026-05-12/findings.md.

Tests

single_node_proof_uses_store_response_timeout (existing)
unknown_proof_uses_store_response_timeout (existing)
merkle_proof_uses_configured_store_timeout (existing)
default_merkle_store_timeout_satisfies_storer_invariant — NEW: pins the invariant that merkle_store_timeout_secs >= 240 + 30
non_merkle_put_ignores_merkle_timeout_value — NEW: pins that non-merkle proof tags ignore whatever value is passed for the merkle timeout
All 248 ant-core lib tests pass
210-node 7-region anselme-testnet: 9/9 merkle uploads pass post-fix vs 0/22 pre-fix

Future work / not in this PR

A --merkle-store-timeout-secs CLI flag for ad-hoc tuning on residential / very-far-region setups. Not added here to keep the diff tight; can follow up.
Field rename of store_timeout_secs to its actual role (chunk GET) is Mick's Split chunk GET timeout from store timeout #78.

…KUP_TIMEOUT+padding Bump DEFAULT_STORE_TIMEOUT_SECS from 10s to 270s for merkle batch PUTs. The storer-side merkle payment verifier runs an iterative DHT lookup with CLOSENESS_LOOKUP_TIMEOUT = 240s (ant-node, post-PR #89). The old 10s client-side timeout fired long before the storer could finish verifying, with three downstream costs: 1. The storer keeps working on a chunk the client has already discarded, wasting CPU and bandwidth. 2. The client re-targets a different close-K member and may double-store the same chunk on a different peer set. 3. Cross-region close-K membership (sgp1 / syd1 storers serving a lon1 client) makes this happen on virtually every merkle chunk, not just a tail. Set client timeout = storer timeout + 30s padding (store-response RTT + storer-local LMDB put/fsync + clock skew tolerance). Invariant: client store-response timeout >= node CLOSENESS_LOOKUP_TIMEOUT + padding. Re-validate if either side's value changes.

…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.

… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.

Setting GITHUB_TOKEN on foundry-toolchain@v1 didn't help: foundryup itself does not read $GITHUB_TOKEN before calling api.github.com to resolve the nightly tag, so macOS runners on shared egress IPs still hit the 60/h anonymous rate limit and 403 every install. Pin to v1.3.6 (last stable as of 2026-05-12) and curl the release tarball directly from the GitHub Releases CDN. Release assets are served from a CDN unaffected by the API rate limit. Unpacks anvil/forge/cast/chisel to /usr/local/bin and verifies versions.

grumbach mentioned this pull request May 12, 2026

Split chunk GET timeout from store timeout #78

Open

grumbach added 4 commits May 12, 2026 16:06

ci: pin foundry to v1.7.1 (latest stable) for direct-download install

8202803

grumbach mentioned this pull request May 12, 2026

feat(client): resumable merkle upload (auto-load cached payment receipt) #84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding#83

fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding#83
grumbach wants to merge 5 commits into
WithAutonomi:mainfrom
grumbach:grumbach/fix-merkle-store-response-timeout

grumbach commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grumbach commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope: merkle PUT only

Reasoning for 270 s

Empirical validation

Tests

Future work / not in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

grumbach commented May 12, 2026 •

edited

Loading