Split chunk GET timeout from store timeout#78
Conversation
PUTs and GETs have different payload directions and performance profiles, so a single shared timeout was a poor fit. Adds `chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a matching `--chunk-get-timeout-secs` CLI flag, while keeping `store_timeout_secs` for the PUT path. Also bumps the non-merkle store-response timeout from 5s to 10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0d760b8 to
8bcdc8b
Compare
…bump related package versions in Cargo.lock
Semver: patch
8bcdc8b to
cdb0945
Compare
Previously `cleanup_stale` skipped any spill dir younger than 24h, even if its lockfile was already releasable. The lockfile is the actual correctness gate: a releasable lock means the owning `ChunkSpill` is dropped or the owning process is gone. The age guard only ever needed to cover the sub-millisecond TOCTOU window between `create_dir` and `try_lock_exclusive` inside `ChunkSpill::new`. The 24h policy was hiding a real leak on hosts where `ant` exits non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run on those paths, so the dir is left in `~/.local/share/ant/spill/` with its lock released. The next upload would not reap it. Under a systemd restart loop, hundreds of `spill_*` dirs accumulate per hour — each holding the encrypted chunks of one upload (= upload file size) — and fill the disk well before the 24h grace expires. Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the lockfile. No other behaviour changes; the lockfile + symlink guard already covered the safety surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`chunk_put_with_proof` was deep-copying every chunk's content into a fresh `Vec<u8>` via `Bytes::to_vec()` before stuffing it into a `ChunkPutRequest`. Since `chunk_put_to_close_group` spawns one of these per recipient (CLOSE_GROUP_MAJORITY ≈ 5) and the AIMD store controller caps concurrent chunks at 64, peak in-flight chunk content reached ~64 × 5 × 4 MB = 2.5 GB just from these copies — enough to OOM-kill `ant` on a 4 GB client VM partway through a 300 MB-or-larger upload. Heaptrack against a clean-exit 20 MB upload (release `ant` on a 4 GB VM) showed 285 MB peak in `alloc::raw_vec::RawVecInner::finish_grow`, with 168 MB of that consumed via `ChunkMessage::encode` → `chunk_put_to_close_group` — i.e. the per-recipient copy + encode loop. ant-protocol now holds `ChunkPutRequest::content` as `bytes::Bytes` (see jacderida/ant-protocol#1), so the caller can pass the refcounted `Bytes` straight through. Each peer's spawned task now shares the single 4 MB backing buffer instead of holding an independent copy. Wire format is unchanged. The patch.crates-io stanza pins ant-protocol to the perf branch commit so the build resolves the Bytes-typed field. The pin should be removed once a crates.io release containing the protocol change is published. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
grumbach
left a comment
There was a problem hiding this comment.
Approving the split — independent chunk_get_timeout_secs is the right shape and removes the implicit coupling between PUT and GET timeouts.
One thing to flag: the non-merkle STORE_RESPONSE_TIMEOUT drop from 30s to 10s in this PR is fine for the non-merkle path (fail fast + retry, as the description says), but the merkle path uses the same store_timeout_secs config and ends up bound by it.
I've just opened #83 which bumps the default store_timeout_secs to 270s = storer-side CLOSENESS_LOOKUP_TIMEOUT (240s) + 30s padding. That's what made merkle uploads pass at scale on a 210-node 7-region testnet (0/22 -> 8/8). The two PRs touch the same constant — order matters.
Either:
- merge #78 first; my #83 rebases on top and bumps the default further.
- merge #83 first; #78 rebases on top, keeps its non-merkle hardcoded value, picks up my new default elsewhere.
Approving this one, will reconcile #83 once one of us has landed. This is #5 in the merge chain (transport -> core -> protocol -> node -> client).
…anged Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent. This commit: - Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const - Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270) - Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field - Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged - Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const) - Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions (merkle PUT / non-merkle PUT / GET) will be cleanly separated.
… limit foundryup curls api.github.com to resolve the nightly tag. Anonymous calls are rate-limited at 60/hour shared per IP; macOS runners hit this regularly and fail every E2E and Merkle E2E job with `curl: (56) ... 403`. Passing the workflow's GITHUB_TOKEN authenticates the call, raising the cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
Summary
chunk_get_timeout_secsclient config value and hidden--chunk-get-timeout-secsCLI flag.store_timeout_secsscoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.Testing