fix: guard against data loss in repair, migrate, and CLI rebuild#935
fix: guard against data loss in repair, migrate, and CLI rebuild#935shaun0927 wants to merge 2 commits intoMemPalace:developfrom
Conversation
- repair.py: wrap upsert loop in try/except; restore from backup on failure instead of leaving a partially rebuilt collection - migrate.py: replace non-atomic rmtree+move with rename-aside swap so a crash between the two calls does not destroy both copies - cli.py: use offset += len(batch["ids"]) with empty-batch guard instead of fixed offset += batch_size to prevent skipping drawers
- repair.py: define backup_path before the conditional block so it is always in scope when the except handler references it - migrate.py: restore old palace from .old if both os.rename and shutil.move fail during the swap step
mvalentsev
left a comment
There was a problem hiding this comment.
Finding 1 (repair crash safety) overlaps with open PR #239 (rusel95), which rewrites repair to a migrate-to-fresh-palace strategy: reads all data via get(), writes to a temp palace, verifies counts match, then atomically swaps directories. Original palace preserved as .backup with auto-restore on failure.
Findings 2 and 3 look clean to me.
|
Correction to my earlier review: Finding 2 (migrate swap): PR #890 touches the same swap code path in migrate.py, adding _replace_palace_dir() with retry logic. Different focus (Windows handle cleanup) but related. Finding 3 (cli.py offset): PRs #239 and #632 both rewrite the repair code path entirely (migrate-to-fresh strategy and nuke-rebuild respectively). If either lands, the extraction loop with the offset bug no longer exists. |
|
Hi, In migrate(), if the cross-device fallback (shutil.move) partially creates the destination directory and then fails, the rollback os.rename(stale_path, palace_path) will also fail because palace_path already exists, leaving a partial migrated palace at the target and the original only at .old. Severity: action required | Category: reliability How to fix: Harden rollback and cleanup Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo code review |
Summary
Three data-integrity fixes that honor CLAUDE.md's design principle: "A crash mid-operation must leave the existing palace untouched."
Refs: #934 (Findings 1, 2, 3)
What changed
mempalace/repair.py— restore from backup on rebuild failurerebuild_indexcallsdelete_collectionthen upserts drawers in a loop. If the upsert crashes partway (OOM, ChromaDB error), the remaining drawers are lost. The backup copy exists but the code never restores from it.mempalace/migrate.py— atomic palace swapshutil.rmtree(palace_path)followed byshutil.move(temp_palace, palace_path)has a window where both copies are gone. Replaced with a rename-aside pattern:mempalace/cli.py— defensive batch extraction offsetcmd_repairextraction loop usedoffset += batch_size(fixed 5000). Changed tooffset += len(batch["ids"])with an empty-batch guard, matching the pattern used inexporter.py:70.Test plan
ruff check mempalace/repair.py mempalace/migrate.py mempalace/cli.py— passedruff format --check mempalace/repair.py mempalace/migrate.py mempalace/cli.py— passed