fix(spill): reap orphan dirs on lock release, not after 24h#81
Open
jacderida wants to merge 1 commit into
Open
fix(spill): reap orphan dirs on lock release, not after 24h#81jacderida wants to merge 1 commit into
jacderida wants to merge 1 commit into
Conversation
Previously `cleanup_stale` skipped any spill dir younger than 24h, even if its lockfile was already releasable. The lockfile is the actual correctness gate: a releasable lock means the owning `ChunkSpill` is dropped or the owning process is gone. The age guard only ever needed to cover the sub-millisecond TOCTOU window between `create_dir` and `try_lock_exclusive` inside `ChunkSpill::new`. The 24h policy was hiding a real leak on hosts where `ant` exits non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run on those paths, so the dir is left in `~/.local/share/ant/spill/` with its lock released. The next upload would not reap it. Under a systemd restart loop, hundreds of `spill_*` dirs accumulate per hour — each holding the encrypted chunks of one upload (= upload file size) — and fill the disk well before the 24h grace expires. Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the lockfile. No other behaviour changes; the lockfile + symlink guard already covered the safety surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mickvandijke
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ChunkSpill::cleanup_stalepreviously skipped anyspill_*directory younger than 24 hours, even if its lockfile was already releasable. This change reduces that age guard to 30 seconds (TOCTOU window only) and treats the lockfile as the primary correctness signal: if the lock is releasable, the owning process is gone and the dir is safe to reap.Why
DropforChunkSpillremoves the directory on graceful exit. It does not run on SIGKILL, kernel OOM-kill, panic abort, or any other non-unwinding termination — all of which release the lockfile but leave the dir behind.The 24h age guard meant those orphans persisted for a day before any subsequent upload would reap them. On a host stuck in a systemd restart loop (e.g. an uploader VM that OOMs every cycle), hundreds of
spill_*dirs accumulate per hour, each holding the full encrypted-chunk set of one upload (~= upload file size). On a recent testnet this filled an 80 GB root partition within ~5 hours and turned a transient memory issue into a permanent disk-exhaustion failure that took uploaders offline.Observed on the affected hosts:
spill_*dirs retainedWhat changes
SPILL_MAX_AGE_SECS(24h) →SPILL_STALE_GRACE_SECS(30s)The lockfile + symlink guards remain unchanged — the safety surface is unchanged for in-flight uploads, since a live
ChunkSpillholds an exclusive lock on.lockfor its entire lifetime.Test plan
cargo check -p ant-core✅ passese2e_huge_filetests still passantprocess mid-upload withSIGKILL; start a new upload; verify the orphan is reaped at startupantuploads; verify neither reaps the other's in-flight dir (lockfile blocks)🤖 Generated with Claude Code