Skip to content

fix(spill): reap orphan dirs on lock release, not after 24h#81

Open
jacderida wants to merge 1 commit into
WithAutonomi:mainfrom
jacderida:fix/spill-cleanup-on-orphan-lock-release
Open

fix(spill): reap orphan dirs on lock release, not after 24h#81
jacderida wants to merge 1 commit into
WithAutonomi:mainfrom
jacderida:fix/spill-cleanup-on-orphan-lock-release

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

Summary

ChunkSpill::cleanup_stale previously skipped any spill_* directory younger than 24 hours, even if its lockfile was already releasable. This change reduces that age guard to 30 seconds (TOCTOU window only) and treats the lockfile as the primary correctness signal: if the lock is releasable, the owning process is gone and the dir is safe to reap.

Why

Drop for ChunkSpill removes the directory on graceful exit. It does not run on SIGKILL, kernel OOM-kill, panic abort, or any other non-unwinding termination — all of which release the lockfile but leave the dir behind.

The 24h age guard meant those orphans persisted for a day before any subsequent upload would reap them. On a host stuck in a systemd restart loop (e.g. an uploader VM that OOMs every cycle), hundreds of spill_* dirs accumulate per hour, each holding the full encrypted-chunk set of one upload (~= upload file size). On a recent testnet this filled an 80 GB root partition within ~5 hours and turned a transient memory issue into a permanent disk-exhaustion failure that took uploaders offline.

Observed on the affected hosts:

File size spill_* dirs retained Total
300 MB 245 72 GB
500 MB 147 72 GB
1000 MB 73 72 GB

What changes

  • SPILL_MAX_AGE_SECS (24h) → SPILL_STALE_GRACE_SECS (30s)
  • The check at the cleanup site flips from "old enough to reap" to "past the TOCTOU window"
  • Doc comments updated to reflect that the lockfile is the primary safety mechanism

The lockfile + symlink guards remain unchanged — the safety surface is unchanged for in-flight uploads, since a live ChunkSpill holds an exclusive lock on .lock for its entire lifetime.

Test plan

  • cargo check -p ant-core ✅ passes
  • Existing e2e_huge_file tests still pass
  • Manual: kill an ant process mid-upload with SIGKILL; start a new upload; verify the orphan is reaped at startup
  • Manual: start two concurrent ant uploads; verify neither reaps the other's in-flight dir (lockfile blocks)

🤖 Generated with Claude Code

Previously `cleanup_stale` skipped any spill dir younger than 24h, even
if its lockfile was already releasable. The lockfile is the actual
correctness gate: a releasable lock means the owning `ChunkSpill` is
dropped or the owning process is gone. The age guard only ever needed
to cover the sub-millisecond TOCTOU window between `create_dir` and
`try_lock_exclusive` inside `ChunkSpill::new`.

The 24h policy was hiding a real leak on hosts where `ant` exits
non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run
on those paths, so the dir is left in `~/.local/share/ant/spill/` with
its lock released. The next upload would not reap it. Under a systemd
restart loop, hundreds of `spill_*` dirs accumulate per hour — each
holding the encrypted chunks of one upload (= upload file size) — and
fill the disk well before the 24h grace expires.

Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the
lockfile. No other behaviour changes; the lockfile + symlink guard
already covered the safety surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants