Skip to content

fix: guard against data loss in repair, migrate, and CLI rebuild#935

Open
shaun0927 wants to merge 2 commits intoMemPalace:developfrom
shaun0927:fix/repair-crash-safety
Open

fix: guard against data loss in repair, migrate, and CLI rebuild#935
shaun0927 wants to merge 2 commits intoMemPalace:developfrom
shaun0927:fix/repair-crash-safety

Conversation

@shaun0927
Copy link
Copy Markdown

Summary

Three data-integrity fixes that honor CLAUDE.md's design principle: "A crash mid-operation must leave the existing palace untouched."

Refs: #934 (Findings 1, 2, 3)

What changed

mempalace/repair.py — restore from backup on rebuild failure

rebuild_index calls delete_collection then upserts drawers in a loop. If the upsert crashes partway (OOM, ChromaDB error), the remaining drawers are lost. The backup copy exists but the code never restores from it.

# Before — no recovery on failure
backend.delete_collection(palace_path, COLLECTION_NAME)
new_col = backend.create_collection(palace_path, COLLECTION_NAME)
for i in range(0, len(all_ids), batch_size):
    new_col.upsert(...)  # crash here = partial data loss

# After — restore backup on failure
try:
    for i in range(0, len(all_ids), batch_size):
        new_col.upsert(...)
except Exception as e:
    shutil.copy2(backup_path, sqlite_path)  # restore pre-repair state
    raise

mempalace/migrate.py — atomic palace swap

shutil.rmtree(palace_path) followed by shutil.move(temp_palace, palace_path) has a window where both copies are gone. Replaced with a rename-aside pattern:

# Before — non-atomic, crash between = total loss
shutil.rmtree(palace_path)
shutil.move(temp_palace, palace_path)

# After — rename old aside first, then swap
os.rename(palace_path, palace_path + ".old")
try:
    os.rename(temp_palace, palace_path)
except OSError:
    shutil.move(temp_palace, palace_path)  # cross-filesystem fallback
shutil.rmtree(palace_path + ".old", ignore_errors=True)

mempalace/cli.py — defensive batch extraction offset

cmd_repair extraction loop used offset += batch_size (fixed 5000). Changed to offset += len(batch["ids"]) with an empty-batch guard, matching the pattern used in exporter.py:70.

# Before
offset += batch_size

# After
if not batch["ids"]:
    break
offset += len(batch["ids"])

Test plan

  • ruff check mempalace/repair.py mempalace/migrate.py mempalace/cli.py — passed
  • ruff format --check mempalace/repair.py mempalace/migrate.py mempalace/cli.py — passed

- repair.py: wrap upsert loop in try/except; restore from backup on
  failure instead of leaving a partially rebuilt collection
- migrate.py: replace non-atomic rmtree+move with rename-aside swap
  so a crash between the two calls does not destroy both copies
- cli.py: use offset += len(batch["ids"]) with empty-batch guard
  instead of fixed offset += batch_size to prevent skipping drawers
@igorls igorls added bug Something isn't working storage labels Apr 16, 2026
- repair.py: define backup_path before the conditional block so it is
  always in scope when the except handler references it
- migrate.py: restore old palace from .old if both os.rename and
  shutil.move fail during the swap step
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding 1 (repair crash safety) overlaps with open PR #239 (rusel95), which rewrites repair to a migrate-to-fresh-palace strategy: reads all data via get(), writes to a temp palace, verifies counts match, then atomically swaps directories. Original palace preserved as .backup with auto-restore on failure.

Findings 2 and 3 look clean to me.

@mvalentsev
Copy link
Copy Markdown
Contributor

Correction to my earlier review:

Finding 2 (migrate swap): PR #890 touches the same swap code path in migrate.py, adding _replace_palace_dir() with retry logic. Different focus (Windows handle cleanup) but related.

Finding 3 (cli.py offset): PRs #239 and #632 both rewrite the repair code path entirely (migrate-to-fresh strategy and nuke-rebuild respectively). If either lands, the extraction loop with the offset bug no longer exists.

@qodo-ai-reviewer
Copy link
Copy Markdown

Hi, In migrate(), if the cross-device fallback (shutil.move) partially creates the destination directory and then fails, the rollback os.rename(stale_path, palace_path) will also fail because palace_path already exists, leaving a partial migrated palace at the target and the original only at .old.

Severity: action required | Category: reliability

How to fix: Harden rollback and cleanup

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.migrate.migrate() can leave the palace in an inconsistent state if the cross-device swap fallback (shutil.move) fails after partially creating palace_path. The rollback then tries os.rename(stale_path, palace_path) which will fail if palace_path already exists.

Issue Context

The new logic correctly avoids the “both missing” window, but needs to handle the case where shutil.move(temp_palace, palace_path) partially copies and creates palace_path before raising.

Fix Focus Areas

  • mempalace/migrate.py[235-246]

What to change

  • In the except Exception: rollback path:
    • If palace_path exists (directory or file), remove it first (e.g., shutil.rmtree(palace_path, ignore_errors=False) / os.remove for file), then restore stale_path into place.
    • Consider wrapping rollback in its own try/except to emit a clear recovery message that names both paths (stale_path and partially-copied palace_path).
  • Optionally, use os.replace for same-filesystem renames, and distinguish cross-device errors (EXDEV) from other OSErrors to avoid masking real rename failures.

Found by Qodo code review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants