fix: guard against data loss in repair, migrate, and CLI rebuild by shaun0927 · Pull Request #935 · MemPalace/mempalace

shaun0927 · 2026-04-16T03:11:43Z

Summary

Three data-integrity fixes that honor CLAUDE.md's design principle: "A crash mid-operation must leave the existing palace untouched."

Refs: #934 (Findings 1, 2, 3)

What changed

`mempalace/repair.py` — restore from backup on rebuild failure

rebuild_index calls delete_collection then upserts drawers in a loop. If the upsert crashes partway (OOM, ChromaDB error), the remaining drawers are lost. The backup copy exists but the code never restores from it.

# Before — no recovery on failure
backend.delete_collection(palace_path, COLLECTION_NAME)
new_col = backend.create_collection(palace_path, COLLECTION_NAME)
for i in range(0, len(all_ids), batch_size):
    new_col.upsert(...)  # crash here = partial data loss

# After — restore backup on failure
try:
    for i in range(0, len(all_ids), batch_size):
        new_col.upsert(...)
except Exception as e:
    shutil.copy2(backup_path, sqlite_path)  # restore pre-repair state
    raise

`mempalace/migrate.py` — atomic palace swap

shutil.rmtree(palace_path) followed by shutil.move(temp_palace, palace_path) has a window where both copies are gone. Replaced with a rename-aside pattern:

# Before — non-atomic, crash between = total loss
shutil.rmtree(palace_path)
shutil.move(temp_palace, palace_path)

# After — rename old aside first, then swap
os.rename(palace_path, palace_path + ".old")
try:
    os.rename(temp_palace, palace_path)
except OSError:
    shutil.move(temp_palace, palace_path)  # cross-filesystem fallback
shutil.rmtree(palace_path + ".old", ignore_errors=True)

`mempalace/cli.py` — defensive batch extraction offset

cmd_repair extraction loop used offset += batch_size (fixed 5000). Changed to offset += len(batch["ids"]) with an empty-batch guard, matching the pattern used in exporter.py:70.

# Before
offset += batch_size

# After
if not batch["ids"]:
    break
offset += len(batch["ids"])

Test plan

ruff check mempalace/repair.py mempalace/migrate.py mempalace/cli.py — passed
ruff format --check mempalace/repair.py mempalace/migrate.py mempalace/cli.py — passed

- repair.py: wrap upsert loop in try/except; restore from backup on failure instead of leaving a partially rebuilt collection - migrate.py: replace non-atomic rmtree+move with rename-aside swap so a crash between the two calls does not destroy both copies - cli.py: use offset += len(batch["ids"]) with empty-batch guard instead of fixed offset += batch_size to prevent skipping drawers

- repair.py: define backup_path before the conditional block so it is always in scope when the except handler references it - migrate.py: restore old palace from .old if both os.rename and shutil.move fail during the swap step

mvalentsev

Finding 1 (repair crash safety) overlaps with open PR #239 (rusel95), which rewrites repair to a migrate-to-fresh-palace strategy: reads all data via get(), writes to a temp palace, verifies counts match, then atomically swaps directories. Original palace preserved as .backup with auto-restore on failure.

Findings 2 and 3 look clean to me.

mvalentsev · 2026-04-16T06:06:05Z

Correction to my earlier review:

Finding 2 (migrate swap): PR #890 touches the same swap code path in migrate.py, adding _replace_palace_dir() with retry logic. Different focus (Windows handle cleanup) but related.

Finding 3 (cli.py offset): PRs #239 and #632 both rewrite the repair code path entirely (migrate-to-fresh strategy and nuke-rebuild respectively). If either lands, the extraction loop with the offset bug no longer exists.

qodo-ai-reviewer · 2026-04-16T13:32:06Z

Hi, In migrate(), if the cross-device fallback (shutil.move) partially creates the destination directory and then fails, the rollback os.rename(stale_path, palace_path) will also fail because palace_path already exists, leaving a partial migrated palace at the target and the original only at .old.

Severity: action required | Category: reliability

How to fix: Harden rollback and cleanup

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

mempalace.migrate.migrate() can leave the palace in an inconsistent state if the cross-device swap fallback (shutil.move) fails after partially creating palace_path. The rollback then tries os.rename(stale_path, palace_path) which will fail if palace_path already exists.

Issue Context

The new logic correctly avoids the “both missing” window, but needs to handle the case where shutil.move(temp_palace, palace_path) partially copies and creates palace_path before raising.

Fix Focus Areas

mempalace/migrate.py[235-246]

What to change

In the except Exception: rollback path:

If palace_path exists (directory or file), remove it first (e.g., shutil.rmtree(palace_path, ignore_errors=False) / os.remove for file), then restore stale_path into place.

Consider wrapping rollback in its own try/except to emit a clear recovery message that names both paths (stale_path and partially-copied palace_path).

Optionally, use os.replace for same-filesystem renames, and distinguish cross-device errors (EXDEV) from other OSErrors to avoid masking real rename failures.

Found by Qodo code review

shaun0927 requested review from bensig and milla-jovovich as code owners April 16, 2026 03:11

igorls added bug Something isn't working storage labels Apr 16, 2026

fix: harden repair backup scope and migrate swap rollback

fb1cf53

- repair.py: define backup_path before the conditional block so it is always in scope when the except handler references it - migrate.py: restore old palace from .old if both os.rename and shutil.move fail during the swap step

mvalentsev reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard against data loss in repair, migrate, and CLI rebuild#935

fix: guard against data loss in repair, migrate, and CLI rebuild#935
shaun0927 wants to merge 2 commits intoMemPalace:developfrom
shaun0927:fix/repair-crash-safety

shaun0927 commented Apr 16, 2026

Uh oh!

mvalentsev left a comment

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

qodo-ai-reviewer commented Apr 16, 2026

Issue description

Issue Context

Fix Focus Areas

What to change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shaun0927 commented Apr 16, 2026

Summary

What changed

mempalace/repair.py — restore from backup on rebuild failure

mempalace/migrate.py — atomic palace swap

mempalace/cli.py — defensive batch extraction offset

Test plan

Uh oh!

mvalentsev left a comment

Choose a reason for hiding this comment

Uh oh!

mvalentsev commented Apr 16, 2026

Uh oh!

qodo-ai-reviewer commented Apr 16, 2026

Issue description

Issue Context

Fix Focus Areas

What to change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`mempalace/repair.py` — restore from backup on rebuild failure

`mempalace/migrate.py` — atomic palace swap

`mempalace/cli.py` — defensive batch extraction offset