joaoh82 · joaoh82 · May 11, 2026 · May 11, 2026
diff --git a/docs/_index.md b/docs/_index.md
@@ -54,7 +54,7 @@ As of May 2026, SQLRite has:
 - Full-text search + hybrid retrieval (Phase 8 complete): FTS5-style inverted index with BM25 ranking + `fts_match` / `bm25_score` scalar functions + `try_fts_probe` optimizer hook + on-disk persistence with on-demand v4 → v5 file-format bump (8a-8c), a worked hybrid-retrieval example combining BM25 with vector cosine via raw arithmetic (8d), and a `bm25_search` MCP tool symmetric with `vector_search` (8e). See [`docs/fts.md`](fts.md).
 - SQL surface + DX follow-ups (Phase 9 complete, v0.2.0 → v0.9.1): DDL completeness — `DEFAULT`, `DROP TABLE` / `DROP INDEX`, `ALTER TABLE` (9a); free-list + manual `VACUUM` (9b) + auto-VACUUM (9c); `IS NULL` / `IS NOT NULL` (9d); `GROUP BY` + aggregates + `DISTINCT` + `LIKE` + `IN` (9e); four flavors of `JOIN` — INNER, LEFT, RIGHT, FULL OUTER (9f); prepared statements + `?` parameter binding with a per-connection LRU plan cache (9g); HNSW probe widened to cosine + dot via `WITH (metric = …)` (9h); `PRAGMA` dispatcher with the `auto_vacuum` knob (9i)
 - Benchmarks against SQLite + DuckDB (Phase 10 complete, SQLR-4 / SQLR-16): twelve-workload bench harness with a pluggable `Driver` trait, criterion-driven, pinned-host runs published. See [`docs/benchmarks.md`](benchmarks.md).
-- Phase 11 (concurrent writes via MVCC + `BEGIN CONCURRENT`, SQLR-22) is in flight. **11.1 → 11.7: shipped.** Engine + SDK error propagation: `Connection` is `Send + Sync`; `Connection::connect()` mints sibling handles. `sqlrite::mvcc` exposes `MvccClock`, `ActiveTxRegistry`, `MvStore`, `ConcurrentTx`. WAL header v1 → v2 persists the clock high-water mark. `PRAGMA journal_mode = mvcc;` opts a database into MVCC. `BEGIN CONCURRENT` writes commit-validate against `MvStore` and abort with `SQLRiteError::Busy`. Reads via `Statement::query` see the BEGIN-time snapshot. Per-commit GC + `vacuum_mvcc()` bound the version chain growth. C FFI / Python / Node / Go all propagate `Busy` / `BusySnapshot` as typed retryable errors. **11.8 multi-handle SDK shape: shipped on this branch.** The FFI's `sqlrite_connect_sibling`, Python's `Connection.connect()`, and Node's `db.connect()` mint sibling handles that share backing state — closes the end-to-end gap from 11.7 where `BusyError` was reachable but not exerciseable through any SDK. Plan: [`docs/concurrent-writes-plan.md`](concurrent-writes-plan.md).
+- Phase 11 (concurrent writes via MVCC + `BEGIN CONCURRENT`, SQLR-22) is in flight. **11.1 → 11.9: shipped.** Engine + SDK error propagation: `Connection` is `Send + Sync`; `Connection::connect()` mints sibling handles. `sqlrite::mvcc` exposes `MvccClock`, `ActiveTxRegistry`, `MvStore`, `ConcurrentTx`, and the `MvccCommitBatch` / `MvccLogRecord` WAL codec. WAL header v1 → v2 persisted the clock high-water mark; **v2 → v3 (11.9)** adds typed MVCC log-record frames. `PRAGMA journal_mode = mvcc;` opts a database into MVCC. `BEGIN CONCURRENT` writes commit-validate against `MvStore`, abort with `SQLRiteError::Busy`, and now also append an MVCC log-record frame to the WAL — covered by the same fsync as the legacy page commit. Reopen replays those frames into `MvStore` and seeds `MvccClock` past the highest committed `commit_ts`, so the MVCC conflict-detection window survives a process restart. Reads via `Statement::query` see the BEGIN-time snapshot. Per-commit GC + `vacuum_mvcc()` bound version-chain growth. C FFI / Python / Node / Go all propagate `Busy` / `BusySnapshot` as typed retryable errors; the FFI's `sqlrite_connect_sibling`, Python's `Connection.connect()`, and Node's `db.connect()` mint sibling handles that share backing state. Plan: [`docs/concurrent-writes-plan.md`](concurrent-writes-plan.md).
 - A fully-automated release pipeline that ships every product to its registry on every release with one human action — Rust engine + `sqlrite-ask` + `sqlrite-mcp` to crates.io, Python wheels to PyPI (`sqlrite`), Node.js + WASM to npm (`@joaoh82/sqlrite` + `@joaoh82/sqlrite-wasm`), Go module via `sdk/go/v*` git tag, plus C FFI tarballs, MCP binary tarballs, and unsigned desktop installers as GitHub Release assets (Phase 6 complete)
 
 See the [Roadmap](roadmap.md) for the full phase plan.

diff --git a/docs/concurrent-writes-plan.md b/docs/concurrent-writes-plan.md
@@ -270,9 +270,11 @@ Goal: more than one `Connection` can target the same `Database` within a process
 
 ### Phase 10.5 — Checkpoint + crash recovery
 
-- Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file.
-- Crash recovery: on open, replay WAL log records into `MvStore`, then replay pager-level commit frames as today.
-- Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not.
+> **Status (roadmap 11.9 — May 2026):** The crash-recovery half landed in roadmap Phase 11.9. WAL format is bumped to v3; commits append a typed `MvccCommitBatch` frame before the legacy save's fsync; reopen replays those frames into `MvStore` and seeds `MvccClock` past the highest `commit_ts`. The checkpoint-drain half — folding MVCC log records into pager-level updates and re-enabling the `Mvcc → Wal` journal-mode downgrade — is the remaining slice and stays parked for a follow-up.
+
+- ~~Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file.~~ *Deferred — see status note above.*
+- Crash recovery: on open, replay WAL log records into `MvStore`, then replay pager-level commit frames as today. **(Shipped — 11.9.)**
+- Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not. **(Shipped — 11.9 covers the clean-drop case which exercises the same recovery codepath; a real OS-kill test is parked with the checkpoint-drain follow-up.)**
 
 ### Phase 10.6 — Garbage collection
 

diff --git a/docs/design-decisions.md b/docs/design-decisions.md
@@ -252,6 +252,63 @@ Each statement inside the transaction runs against the working `tables` clone vi
 
 ---
 
+### 12g. MVCC commits piggyback on the legacy save's fsync (Phase 11.9)
+
+**Decision.** `BEGIN CONCURRENT` commits now leave a typed
+[`MvccCommitBatch`](../src/mvcc/log.rs) frame in the WAL *before*
+the legacy `save_database` runs. The MVCC frame uses
+`page_num = MVCC_FRAME_MARKER (u32::MAX)` and `commit_page_count =
+None`, so it is **not** fsync'd on its own. The legacy save then
+appends its page commits and ends with the existing page-0 commit
+frame (which *is* fsync'd). That single fsync flushes everything
+buffered behind it, covering the MVCC frame and the page commits
+in one durability boundary.
+
+**Why piggyback rather than fsync per MVCC frame.** Each fsync is
+the dominant cost of a small commit (often >90% of wall time on
+SSDs). Dual fsyncs would double the cost of a `BEGIN CONCURRENT`
+commit relative to a legacy commit for no correctness gain — the
+two writes already need to be atomic with each other (a crash that
+keeps one but loses the other would either resurrect uncommitted
+state in `MvStore` or hide a durable legacy update from the
+in-memory MVCC index). Sharing the boundary makes the atomicity
+free: torn-write recovery already drops dirty frames past the last
+commit barrier, and that recovery treats the MVCC frame as just
+another dirty frame waiting for its commit-barrier.
+
+**Why the marker is `u32::MAX`.** Page numbers are bounded by the
+file's `page_count`, which sits well below `u32::MAX` for any
+realistic database (the maximum page-addressable file size at
+4 KiB pages is 16 TiB). Choosing the sentinel from outside the
+legal range keeps the discriminator a single integer comparison
+on the existing frame-header layout — no new flag field, no
+binary-incompatible header.
+
+**Why the clock is seeded from `max(header.clock_high_water,
+max(commit_ts in WAL))`.** The WAL header's `clock_high_water`
+field is only persisted on checkpoint (which fsync's the
+truncated WAL). Between checkpoints, the in-memory header is
+ahead of the on-disk header — and an unclean process exit drops
+that in-memory lead. The MVCC frames themselves are durable, and
+each carries its `commit_ts`, so the replay walks the recovered
+batches and takes the higher of the two seeds. Without this
+maxing step a crash between commits and checkpoint could let a
+post-reopen transaction hand out a `begin_ts` *below* an
+already-committed version's `end` — an immediate snapshot-isolation
+violation.
+
+**What 11.9 deferred.** The checkpoint half of plan-doc Phase
+10.5: draining `MvStore` versions back into the pager so a
+WAL truncate doesn't lose them, and re-enabling the `Mvcc → Wal`
+journal-mode downgrade. The legacy save mirror still covers
+durability of the visible row state on the read path, so this
+gap is foundation work — not a correctness regression — and
+the existing per-commit GC bounds in-memory chain growth.
+
+**Plan-doc reference.** [`concurrent-writes-plan.md`](concurrent-writes-plan.md) §3.3 (durability model), §4.6 (WAL log records), §10.5 (checkpoint integration — partially shipped, see note in plan doc).
+
+---
+
 ## Query execution
 
 ### 13. `NULL`-as-false in `WHERE` clauses

diff --git a/docs/file-format.md b/docs/file-format.md
@@ -329,6 +329,7 @@ A second file alongside the `.sqlrite`, named `<stem>.sqlrite-wal`, records page
 │     8  │    4   │ format version (u32 LE)                         │
 │        │        │   1 = pre-Phase-11                              │
 │        │        │   2 = Phase 11.2 — adds clock_high_water        │
+│        │        │   3 = Phase 11.9 — adds MVCC log-record frames  │
 │    12  │    4   │ page size      (u32 LE) = 4096                  │
 │    16  │    4   │ salt (u32 LE) — rolled each checkpoint          │
 │    20  │    4   │ checkpoint seq (u32 LE) — increments per ckpt   │
@@ -343,9 +344,21 @@ A second file alongside the `.sqlrite`, named `<stem>.sqlrite-wal`, records page
 were reserved-zero in v1, so a pre-Phase-11 WAL opens cleanly: the
 parser interprets the zeros as `clock_high_water = 0`, which is
 indistinguishable from "fresh checkpoint, clock has never advanced."
-The next checkpoint rewrites the header at v2 — there's no offline
-upgrade step. Forward versions we don't recognize (e.g. v3) error
-out with a clean diagnostic rather than misinterpreting the bytes.
+The next checkpoint rewrites the header at the current version —
+there's no offline upgrade step. Forward versions we don't recognise
+error out with a clean diagnostic rather than misinterpreting the
+bytes.
+
+**v2 → v3 compatibility.** v3 doesn't change the header layout at
+all — only the set of frame kinds the body stream can carry. A v2
+reader on a v3 file would still parse every frame correctly *except*
+that it would not recognise the MVCC-marker frames and would skip
+them silently as if they were unknown page numbers (the page-number
+field reads `u32::MAX`). We bump the header anyway so v2 readers
+emit the usual "unsupported WAL format version" diagnostic on a v3
+WAL, surfacing the mismatch instead of silently losing MVCC
+durability. The current build accepts v1..=v3 on open and writes v3
+on every new WAL.
 
 ### Frames
 
@@ -367,6 +380,67 @@ Each frame is `FRAME_HEADER_SIZE + PAGE_SIZE` = **4112 bytes**:
 └────────┴────────┴─────────────────────────────────────────────────┘
 ```
 
+### MVCC log-record frames (Phase 11.9)
+
+When the database is in `journal_mode = mvcc`, a successful `BEGIN
+CONCURRENT` commit appends a second frame on top of the legacy page
+frames: an MVCC log record that captures the resolved write-set.
+The frame uses the same 4112-byte envelope but is distinguished by
+the page-number field carrying the sentinel `u32::MAX`
+(`MVCC_FRAME_MARKER`). Real page numbers are bounded by file size,
+so the sentinel can never collide with a legitimate page frame.
+
+The body carries:
+
+```
+┌────────┬────────┬─────────────────────────────────────────────────┐
+│ offset │ length │ content                                         │
+├────────┼────────┼─────────────────────────────────────────────────┤
+│     0  │    8   │ magic: "MVCC0001" (ASCII, no NUL)               │
+│     8  │    8   │ commit_ts (u64 LE)                              │
+│    16  │    2   │ record count (u16 LE)                           │
+│    18  │ var.   │ records — for each:                             │
+│        │        │   1 byte  op tag (0 = Tombstone, 1 = Present)   │
+│        │        │   2 + N   table name (length-prefixed)          │
+│        │        │   8       rowid (i64 LE)                        │
+│        │        │   if op = 1: column count (u16 LE) + per-column │
+│        │        │     (name, type tag, value) tuples              │
+│   ...  │   ...  │ zero-padded to PAGE_SIZE                        │
+└────────┴────────┴─────────────────────────────────────────────────┘
+```
+
+Value type tags inside a record:
+
+```
+0  Null
+1  Int   — 8 bytes i64 LE
+2  Real  — 8 bytes f64 LE
+3  Text  — 4 + N bytes (u32 LE length, then UTF-8 bytes)
+4  Bool  — 1 byte (0 / 1)
+5  Vector — 4 + 4*N bytes (u32 LE length, then f32 LE elements)
+```
+
+A batch must fit in the 4096-byte body; encoder rejects oversize
+batches with a typed error. Multi-frame batches (for very large
+transactions) are a deferred follow-up.
+
+The MVCC frame is appended with `commit_page_count = None` (dirty)
+so its own `fsync` is skipped. The very next legacy commit frame
+that the same `save_database` writes will `fsync` the whole buffer
+— covering both the MVCC frame and the legacy page updates in one
+boundary. A crash between the two append calls drops both, which
+is the right rollback semantics.
+
+On reopen, the replay loop branches on `page_num`:
+`MVCC_FRAME_MARKER` frames are decoded into `MvccCommitBatch` and
+held in a pending list that promotes onto the recovered list each
+time the next legacy commit frame seals the transaction. The
+`Pager` exposes the recovered batches as
+`Pager::recovered_mvcc_commits()`; `pager::open_database` drains
+them into `Database::mv_store` via `MvStore::push_committed` and
+seeds `Database::mvcc_clock` past the highest replayed
+`commit_ts`.
+
 ### Torn-write recovery
 
 On open the reader walks every frame from `WAL_HEADER_SIZE`, validating salt and checksum. The first invalid or incomplete frame marks the end of the usable log — its bytes and anything after stay on disk but are treated as nonexistent. Callers get a clean in-memory index of `(page → latest-committed-frame-offset)` and a `last_commit_offset` boundary.

diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -657,7 +657,7 @@ Bounds in-memory growth of the [`MvStore`](../src/mvcc/store.rs) version chains.
 - **Go SDK**: two new sentinel error values `sqlrite.ErrBusy` / `sqlrite.ErrBusySnapshot`, plus an `IsRetryable(err error) bool` helper. `wrapErr` recognises the new FFI status codes and wraps the engine message with `fmt.Errorf("…: %w", ErrBusy)`.
 - **WASM SDK** — deliberately untouched (browser is single-threaded; multi-handle shape not yet exposed).
 
-### 🚧 Phase 11.8 — Multi-handle SDK shape *(in progress, was plan-doc 11.8's other half; promoted ahead of plan-doc 11.5 again because the 11.7 retry-error machinery can't be exercised end-to-end through any SDK until siblings are reachable)*
+### ✅ Phase 11.8 — Multi-handle SDK shape *(in progress, was plan-doc 11.8's other half; promoted ahead of plan-doc 11.5 again because the 11.7 retry-error machinery can't be exercised end-to-end through any SDK until siblings are reachable)*
 
 Each pre-11.8 SDK `connect()` / `new Database()` built an *isolated* backing DB; the 11.7 `BusyError` / `errorKind` / `ErrBusy` plumbing was reachable but not actually triggerable from user code. This slice exposes the engine's `Connection::connect()` through every reachable language so apps can mint sibling handles that share state, and finally exercise the 11.7 retry idioms with real cross-handle conflicts.
 
@@ -669,9 +669,17 @@ Each pre-11.8 SDK `connect()` / `new Database()` built an *isolated* backing DB;
 
 Each SDK gets end-to-end tests that exercise `BEGIN CONCURRENT` cross-handle conflicts: two sibling handles, two concurrent transactions on the same row, the second commit hits the SDK's typed retryable error, retry succeeds.
 
-### Phase 11.9 — Checkpoint integration + crash recovery *(planned, plan-doc "Phase 10.5"; renumbered to follow SDK propagation because durability via the legacy `save_database` mirror already works in v0; this slice is foundation work for cross-process MVCC and column-level WAL deltas)*
+### ✅ Phase 11.9 — WAL log-record durability + crash recovery *(plan-doc "Phase 10.5"; renumbered to follow SDK propagation because durability via the legacy `save_database` mirror already worked in v0)*
 
-MVCC log-record WAL frame format (the deferred 11.4 piece). Commit appends log records pre-`save_database`. Reopen replays log records into `MvStore`. Checkpoint drains `MvStore` versions back into the pager (so `Mvcc → Wal` becomes legal once the store is empty). Crash-recovery test: kill mid-commit between log-record append and version-chain push; reopen; verify the committed transaction is visible and the half-written one is not.
+MVCC commits now leave a typed log-record frame in the WAL on top of the existing page-level commit. The MVCC frame is appended before the legacy save's commit-frame fsync, so a single fsync covers both: a crash either keeps both or loses both. On reopen, the WAL replay decodes every MVCC frame and re-pushes the row versions into `MvStore`; the in-memory MVCC clock is seeded past the highest replayed `commit_ts` so post-restart transactions can never hand out a regressed `begin_ts`.
+
+- **WAL format version bumped to v3.** v1 / v2 are still readable (replay just sees zero MVCC frames); v3 adds the MVCC frame marker (`page_num = u32::MAX`) and the body codec.
+- **Frame body codec** ([`src/mvcc/log.rs`](../src/mvcc/log.rs)): `MvccCommitBatch { commit_ts, records }` encoded with magic `MVCC0001`, then `commit_ts` (u64 LE), record count (u16 LE), then per-record `(op tag, table name, rowid, optional column-value pairs)`. Everything fits in the 4 KiB frame body; the encoder surfaces a typed error if a single commit overflows (multi-frame batches are a deferred slice).
+- **Append path** ([`src/connection.rs`](../src/connection.rs) `commit_concurrent`): after validation passes, the resolved write-set is encoded into a batch, appended to the WAL (no fsync), and then `save_database` runs and seals the transaction with its own fsync. The clock high-water in the WAL header is also bumped so a future checkpoint persists it.
+- **Replay path** ([`src/sql/pager/mod.rs`](../src/sql/pager/mod.rs) `replay_mvcc_into_db`): drains `Pager::recovered_mvcc_commits` into `MvStore` and observes the clock past `max(header.clock_high_water, max(commit_ts))`. Replay is unconditional — `JournalMode::Wal`-mode databases simply see zero frames.
+- **Tests** ([`src/connection.rs`](../src/connection.rs)): six new cases cover round-trip persistence, multi-row batches, ROLLBACK-leaves-no-frame, legacy-commit-leaves-no-frame, multi-commit replay after an unclean close, and clock-seeding past the last `commit_ts`.
+
+**Out of scope for 11.9** (parked for a follow-up): checkpoint draining the `MvStore` versions back into the pager (which would let `set_journal_mode(Mvcc → Wal)` succeed); a real OS-level kill-mid-commit test (the existing test uses a clean drop, which exercises the same crash-recovery codepath because the WAL is the durable record).
 
 ### Phase 11.10 — Indexes under MVCC *(deferred-by-design, plan-doc "Phase 10.7")*