Skip to content

fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding#83

Open
grumbach wants to merge 5 commits into
WithAutonomi:mainfrom
grumbach:grumbach/fix-merkle-store-response-timeout
Open

fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding#83
grumbach wants to merge 5 commits into
WithAutonomi:mainfrom
grumbach:grumbach/fix-merkle-store-response-timeout

Conversation

@grumbach
Copy link
Copy Markdown
Contributor

@grumbach grumbach commented May 12, 2026

Summary

Adds merkle_store_timeout_secs (default 270 s) to ClientConfig and routes the merkle batch PUT response timeout through it.

Storer-side merkle payment verifier runs an iterative DHT lookup with CLOSENESS_LOOKUP_TIMEOUT = 240 s (ant-node, post-PR #89). The old 10 s default fired long before the storer could finish verifying, with three downstream costs:

  1. The storer keeps working on a chunk the client has already discarded, wasting CPU and bandwidth.
  2. The client re-targets a different close-K member and may double-store the same chunk on a different peer set.
  3. Cross-region close-K membership (sgp1 / syd1 storers serving a lon1 client) makes this happen on virtually every merkle chunk, not just a tail.

This is the 5th merkle failure mode — orthogonal to A+G (#89), clock-skew (#91), and the eviction trio (#114/#90/#77).

Scope: merkle PUT only

This PR intentionally does not touch the non-merkle PUT path (already uses the STORE_RESPONSE_TIMEOUT const in chunk.rs, currently 30 s) or the chunk GET path (currently reads store_timeout_secs, which stays at 10 s — same as main).

The GET-path rename is Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both PRs land, the three timeout regions (merkle PUT / non-merkle PUT / GET) are cleanly separated:

Path Knob Default
merkle PUT response merkle_store_timeout_secs (this PR) 270 s
non-merkle PUT response STORE_RESPONSE_TIMEOUT const 30 s (10 s post Mick #78)
chunk GET response chunk_get_timeout_secs (Mick #78) 10 s

Reasoning for 270 s

Client timeout = storer timeout + 30 s padding (store-response RTT + storer-local LMDB put / fsync + clock-skew tolerance). 240 + 30 = 270.

Invariant: client merkle store-response timeout >= node CLOSENESS_LOOKUP_TIMEOUT + padding. Pinned by the new default_merkle_store_timeout_satisfies_storer_invariant regression test so it cannot silently re-break.

Empirical validation

Ran a 210-node 7-region testnet (lon1, fra1, sgp1, nyc1, ams3, tor1, syd1) on a branch combining main + Mick's fix/stability-improvements + the eviction trio (#114, #90, #77):

merkle_store_timeout_secs Multi-region merkle uploads
10 s (current main) 0 / 22 pass — every chunk times out
270 s (this PR) 9 / 9 pass — every iteration completes

Failure signature on the unpatched side:

Failed to store chunk on <peer-id>: timeout: Timeout waiting for store response after 10s
[...]
Error: Upload failed: 2/16 stored, 1 failed:
  insufficient peers: Stored on 1 peers, need 4

Findings doc in the saorsa repo under notes/anselme-testnet-2026-05-12/findings.md.

Tests

  • single_node_proof_uses_store_response_timeout (existing)
  • unknown_proof_uses_store_response_timeout (existing)
  • merkle_proof_uses_configured_store_timeout (existing)
  • default_merkle_store_timeout_satisfies_storer_invariant — NEW: pins the invariant that merkle_store_timeout_secs >= 240 + 30
  • non_merkle_put_ignores_merkle_timeout_value — NEW: pins that non-merkle proof tags ignore whatever value is passed for the merkle timeout
  • All 248 ant-core lib tests pass
  • 210-node 7-region anselme-testnet: 9/9 merkle uploads pass post-fix vs 0/22 pre-fix

Future work / not in this PR

  • A --merkle-store-timeout-secs CLI flag for ad-hoc tuning on residential / very-far-region setups. Not added here to keep the diff tight; can follow up.
  • Field rename of store_timeout_secs to its actual role (chunk GET) is Mick's Split chunk GET timeout from store timeout #78.

…KUP_TIMEOUT+padding

Bump DEFAULT_STORE_TIMEOUT_SECS from 10s to 270s for merkle batch PUTs.

The storer-side merkle payment verifier runs an iterative DHT lookup
with CLOSENESS_LOOKUP_TIMEOUT = 240s (ant-node, post-PR #89). The
old 10s client-side timeout fired long before the storer could
finish verifying, with three downstream costs:

1. The storer keeps working on a chunk the client has already
   discarded, wasting CPU and bandwidth.
2. The client re-targets a different close-K member and may
   double-store the same chunk on a different peer set.
3. Cross-region close-K membership (sgp1 / syd1 storers serving
   a lon1 client) makes this happen on virtually every merkle
   chunk, not just a tail.

Set client timeout = storer timeout + 30s padding (store-response
RTT + storer-local LMDB put/fsync + clock skew tolerance).

Invariant: client store-response timeout >= node
CLOSENESS_LOOKUP_TIMEOUT + padding. Re-validate if either side's
value changes.
grumbach added 4 commits May 12, 2026 16:06
…anged

Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads
store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent.

This commit:
- Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const
- Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270)
- Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field
- Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged
- Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const)
- Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value

Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions
(merkle PUT / non-merkle PUT / GET) will be cleanly separated.
… limit

foundryup curls api.github.com to resolve the nightly tag. Anonymous
calls are rate-limited at 60/hour shared per IP; macOS runners hit this
regularly and fail every E2E and Merkle E2E job with
`curl: (56) ... 403`.

Passing the workflow's GITHUB_TOKEN authenticates the call, raising the
cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
Setting GITHUB_TOKEN on foundry-toolchain@v1 didn't help: foundryup
itself does not read $GITHUB_TOKEN before calling api.github.com to
resolve the nightly tag, so macOS runners on shared egress IPs still
hit the 60/h anonymous rate limit and 403 every install.

Pin to v1.3.6 (last stable as of 2026-05-12) and curl the release
tarball directly from the GitHub Releases CDN. Release assets are
served from a CDN unaffected by the API rate limit. Unpacks
anvil/forge/cast/chisel to /usr/local/bin and verifies versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant