Skip to content

Split chunk GET timeout from store timeout#78

Open
mickvandijke wants to merge 6 commits into
mainfrom
fix/stability-improvements
Open

Split chunk GET timeout from store timeout#78
mickvandijke wants to merge 6 commits into
mainfrom
fix/stability-improvements

Conversation

@mickvandijke
Copy link
Copy Markdown
Contributor

Summary

  • Add an independent chunk_get_timeout_secs client config value and hidden --chunk-get-timeout-secs CLI flag.
  • Keep store_timeout_secs scoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.
  • Reduce the non-Merkle chunk PUT store-response timeout from 30 seconds to 10 seconds.

Testing

  • Not run; PR opened from the existing committed branch state.

mickvandijke and others added 2 commits May 9, 2026 23:40
PUTs and GETs have different payload directions and performance
profiles, so a single shared timeout was a poor fit. Adds
`chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a
matching `--chunk-get-timeout-secs` CLI flag, while keeping
`store_timeout_secs` for the PUT path. Also bumps the non-merkle
store-response timeout from 5s to 10s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke force-pushed the fix/stability-improvements branch 2 times, most recently from 0d760b8 to 8bcdc8b Compare May 10, 2026 16:41
@mickvandijke mickvandijke force-pushed the fix/stability-improvements branch from 8bcdc8b to cdb0945 Compare May 10, 2026 20:35
jacderida and others added 2 commits May 11, 2026 17:32
Previously `cleanup_stale` skipped any spill dir younger than 24h, even
if its lockfile was already releasable. The lockfile is the actual
correctness gate: a releasable lock means the owning `ChunkSpill` is
dropped or the owning process is gone. The age guard only ever needed
to cover the sub-millisecond TOCTOU window between `create_dir` and
`try_lock_exclusive` inside `ChunkSpill::new`.

The 24h policy was hiding a real leak on hosts where `ant` exits
non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run
on those paths, so the dir is left in `~/.local/share/ant/spill/` with
its lock released. The next upload would not reap it. Under a systemd
restart loop, hundreds of `spill_*` dirs accumulate per hour — each
holding the encrypted chunks of one upload (= upload file size) — and
fill the disk well before the 24h grace expires.

Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the
lockfile. No other behaviour changes; the lockfile + symlink guard
already covered the safety surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`chunk_put_with_proof` was deep-copying every chunk's content into a
fresh `Vec<u8>` via `Bytes::to_vec()` before stuffing it into a
`ChunkPutRequest`. Since `chunk_put_to_close_group` spawns one of these
per recipient (CLOSE_GROUP_MAJORITY ≈ 5) and the AIMD store controller
caps concurrent chunks at 64, peak in-flight chunk content reached
~64 × 5 × 4 MB = 2.5 GB just from these copies — enough to OOM-kill
`ant` on a 4 GB client VM partway through a 300 MB-or-larger upload.

Heaptrack against a clean-exit 20 MB upload (release `ant` on a 4 GB
VM) showed 285 MB peak in `alloc::raw_vec::RawVecInner::finish_grow`,
with 168 MB of that consumed via `ChunkMessage::encode` →
`chunk_put_to_close_group` — i.e. the per-recipient copy + encode loop.

ant-protocol now holds `ChunkPutRequest::content` as `bytes::Bytes`
(see jacderida/ant-protocol#1), so the caller
can pass the refcounted `Bytes` straight through. Each peer's spawned
task now shares the single 4 MB backing buffer instead of holding an
independent copy. Wire format is unchanged.

The patch.crates-io stanza pins ant-protocol to the perf branch commit
so the build resolves the Bytes-typed field. The pin should be removed
once a crates.io release containing the protocol change is published.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the split — independent chunk_get_timeout_secs is the right shape and removes the implicit coupling between PUT and GET timeouts.

One thing to flag: the non-merkle STORE_RESPONSE_TIMEOUT drop from 30s to 10s in this PR is fine for the non-merkle path (fail fast + retry, as the description says), but the merkle path uses the same store_timeout_secs config and ends up bound by it.

I've just opened #83 which bumps the default store_timeout_secs to 270s = storer-side CLOSENESS_LOOKUP_TIMEOUT (240s) + 30s padding. That's what made merkle uploads pass at scale on a 210-node 7-region testnet (0/22 -> 8/8). The two PRs touch the same constant — order matters.

Either:

  • merge #78 first; my #83 rebases on top and bumps the default further.
  • merge #83 first; #78 rebases on top, keeps its non-merkle hardcoded value, picks up my new default elsewhere.

Approving this one, will reconcile #83 once one of us has landed. This is #5 in the merge chain (transport -> core -> protocol -> node -> client).

grumbach added a commit to grumbach/ant-client that referenced this pull request May 12, 2026
…anged

Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads
store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent.

This commit:
- Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const
- Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270)
- Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field
- Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged
- Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const)
- Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value

Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions
(merkle PUT / non-merkle PUT / GET) will be cleanly separated.
grumbach added a commit to grumbach/ant-client that referenced this pull request May 12, 2026
… limit

foundryup curls api.github.com to resolve the nightly tag. Anonymous
calls are rate-limited at 60/hour shared per IP; macOS runners hit this
regularly and fail every E2E and Merkle E2E job with
`curl: (56) ... 403`.

Passing the workflow's GITHUB_TOKEN authenticates the call, raising the
cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants