fix(spill): reap orphan dirs on lock release, not after 24h by jacderida · Pull Request #81 · WithAutonomi/ant-client

jacderida · 2026-05-11T15:06:02Z

Summary

ChunkSpill::cleanup_stale previously skipped any spill_* directory younger than 24 hours, even if its lockfile was already releasable. This change reduces that age guard to 30 seconds (TOCTOU window only) and treats the lockfile as the primary correctness signal: if the lock is releasable, the owning process is gone and the dir is safe to reap.

Why

Drop for ChunkSpill removes the directory on graceful exit. It does not run on SIGKILL, kernel OOM-kill, panic abort, or any other non-unwinding termination — all of which release the lockfile but leave the dir behind.

The 24h age guard meant those orphans persisted for a day before any subsequent upload would reap them. On a host stuck in a systemd restart loop (e.g. an uploader VM that OOMs every cycle), hundreds of spill_* dirs accumulate per hour, each holding the full encrypted-chunk set of one upload (~= upload file size). On a recent testnet this filled an 80 GB root partition within ~5 hours and turned a transient memory issue into a permanent disk-exhaustion failure that took uploaders offline.

Observed on the affected hosts:

File size	`spill_*` dirs retained	Total
300 MB	245	72 GB
500 MB	147	72 GB
1000 MB	73	72 GB

What changes

SPILL_MAX_AGE_SECS (24h) → SPILL_STALE_GRACE_SECS (30s)
The check at the cleanup site flips from "old enough to reap" to "past the TOCTOU window"
Doc comments updated to reflect that the lockfile is the primary safety mechanism

The lockfile + symlink guards remain unchanged — the safety surface is unchanged for in-flight uploads, since a live ChunkSpill holds an exclusive lock on .lock for its entire lifetime.

Test plan

cargo check -p ant-core ✅ passes
Existing e2e_huge_file tests still pass
Manual: kill an ant process mid-upload with SIGKILL; start a new upload; verify the orphan is reaped at startup
Manual: start two concurrent ant uploads; verify neither reaps the other's in-flight dir (lockfile blocks)

🤖 Generated with Claude Code

Previously `cleanup_stale` skipped any spill dir younger than 24h, even if its lockfile was already releasable. The lockfile is the actual correctness gate: a releasable lock means the owning `ChunkSpill` is dropped or the owning process is gone. The age guard only ever needed to cover the sub-millisecond TOCTOU window between `create_dir` and `try_lock_exclusive` inside `ChunkSpill::new`. The 24h policy was hiding a real leak on hosts where `ant` exits non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run on those paths, so the dir is left in `~/.local/share/ant/spill/` with its lock released. The next upload would not reap it. Under a systemd restart loop, hundreds of `spill_*` dirs accumulate per hour — each holding the encrypted chunks of one upload (= upload file size) — and fill the disk well before the 24h grace expires. Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the lockfile. No other behaviour changes; the lockfile + symlink guard already covered the safety surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mickvandijke approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spill): reap orphan dirs on lock release, not after 24h#81

fix(spill): reap orphan dirs on lock release, not after 24h#81
jacderida wants to merge 1 commit into
WithAutonomi:mainfrom
jacderida:fix/spill-cleanup-on-orphan-lock-release

jacderida commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacderida commented May 11, 2026

Summary

Why

What changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants