Split chunk GET timeout from store timeout by mickvandijke · Pull Request #78 · WithAutonomi/ant-client

mickvandijke · 2026-05-09T21:38:13Z

Summary

Add an independent chunk_get_timeout_secs client config value and hidden --chunk-get-timeout-secs CLI flag.
Keep store_timeout_secs scoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.
Reduce the non-Merkle chunk PUT store-response timeout from 30 seconds to 10 seconds.

Testing

Not run; PR opened from the existing committed branch state.

PUTs and GETs have different payload directions and performance profiles, so a single shared timeout was a poor fit. Adds `chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a matching `--chunk-get-timeout-secs` CLI flag, while keeping `store_timeout_secs` for the PUT path. Also bumps the non-merkle store-response timeout from 5s to 10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bump related package versions in Cargo.lock

Semver: patch

Previously `cleanup_stale` skipped any spill dir younger than 24h, even if its lockfile was already releasable. The lockfile is the actual correctness gate: a releasable lock means the owning `ChunkSpill` is dropped or the owning process is gone. The age guard only ever needed to cover the sub-millisecond TOCTOU window between `create_dir` and `try_lock_exclusive` inside `ChunkSpill::new`. The 24h policy was hiding a real leak on hosts where `ant` exits non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run on those paths, so the dir is left in `~/.local/share/ant/spill/` with its lock released. The next upload would not reap it. Under a systemd restart loop, hundreds of `spill_*` dirs accumulate per hour — each holding the encrypted chunks of one upload (= upload file size) — and fill the disk well before the 24h grace expires. Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the lockfile. No other behaviour changes; the lockfile + symlink guard already covered the safety surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`chunk_put_with_proof` was deep-copying every chunk's content into a fresh `Vec<u8>` via `Bytes::to_vec()` before stuffing it into a `ChunkPutRequest`. Since `chunk_put_to_close_group` spawns one of these per recipient (CLOSE_GROUP_MAJORITY ≈ 5) and the AIMD store controller caps concurrent chunks at 64, peak in-flight chunk content reached ~64 × 5 × 4 MB = 2.5 GB just from these copies — enough to OOM-kill `ant` on a 4 GB client VM partway through a 300 MB-or-larger upload. Heaptrack against a clean-exit 20 MB upload (release `ant` on a 4 GB VM) showed 285 MB peak in `alloc::raw_vec::RawVecInner::finish_grow`, with 168 MB of that consumed via `ChunkMessage::encode` → `chunk_put_to_close_group` — i.e. the per-recipient copy + encode loop. ant-protocol now holds `ChunkPutRequest::content` as `bytes::Bytes` (see jacderida/ant-protocol#1), so the caller can pass the refcounted `Bytes` straight through. Each peer's spawned task now shares the single 4 MB backing buffer instead of holding an independent copy. Wire format is unchanged. The patch.crates-io stanza pins ant-protocol to the perf branch commit so the build resolves the Bytes-typed field. The pin should be removed once a crates.io release containing the protocol change is published. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

grumbach

Approving the split — independent chunk_get_timeout_secs is the right shape and removes the implicit coupling between PUT and GET timeouts.

One thing to flag: the non-merkle STORE_RESPONSE_TIMEOUT drop from 30s to 10s in this PR is fine for the non-merkle path (fail fast + retry, as the description says), but the merkle path uses the same store_timeout_secs config and ends up bound by it.

I've just opened #83 which bumps the default store_timeout_secs to 270s = storer-side CLOSENESS_LOOKUP_TIMEOUT (240s) + 30s padding. That's what made merkle uploads pass at scale on a 210-node 7-region testnet (0/22 -> 8/8). The two PRs touch the same constant — order matters.

Either:

merge #78 first; my #83 rebases on top and bumps the default further.
merge #83 first; #78 rebases on top, keeps its non-merkle hardcoded value, picks up my new default elsewhere.

Approving this one, will reconcile #83 once one of us has landed. This is #5 in the merge chain (transport -> core -> protocol -> node -> client).

…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.

… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.

mickvandijke and others added 2 commits May 9, 2026 23:40

fix(chunk): reduce store-response timeout for non-merkle PUTs

2ff8cc9

mickvandijke force-pushed the fix/stability-improvements branch 2 times, most recently from 0d760b8 to 8bcdc8b Compare May 10, 2026 16:41

mickvandijke added 2 commits May 10, 2026 22:35

chore: update Git branches for ant-protocol and saorsa dependencies, …

7fdeb20

…bump related package versions in Cargo.lock

fix(tests): ignore message transport provenance

cdb0945

Semver: patch

mickvandijke force-pushed the fix/stability-improvements branch from 8bcdc8b to cdb0945 Compare May 10, 2026 20:35

jacderida and others added 2 commits May 11, 2026 17:32

grumbach approved these changes May 12, 2026

View reviewed changes

grumbach mentioned this pull request May 12, 2026

fix(client): match DEFAULT_STORE_TIMEOUT_SECS to storer CLOSENESS_LOOKUP_TIMEOUT+padding #83

Open

7 tasks

grumbach mentioned this pull request May 12, 2026

feat(client): resumable merkle upload (auto-load cached payment receipt) #84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split chunk GET timeout from store timeout#78

Split chunk GET timeout from store timeout#78
mickvandijke wants to merge 6 commits into
mainfrom
fix/stability-improvements

mickvandijke commented May 9, 2026

Uh oh!

grumbach left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mickvandijke commented May 9, 2026

Summary

Testing

Uh oh!

grumbach left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants