Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ As of May 2026, SQLRite has:
- Full-text search + hybrid retrieval (Phase 8 complete): FTS5-style inverted index with BM25 ranking + `fts_match` / `bm25_score` scalar functions + `try_fts_probe` optimizer hook + on-disk persistence with on-demand v4 → v5 file-format bump (8a-8c), a worked hybrid-retrieval example combining BM25 with vector cosine via raw arithmetic (8d), and a `bm25_search` MCP tool symmetric with `vector_search` (8e). See [`docs/fts.md`](fts.md).
- SQL surface + DX follow-ups (Phase 9 complete, v0.2.0 → v0.9.1): DDL completeness — `DEFAULT`, `DROP TABLE` / `DROP INDEX`, `ALTER TABLE` (9a); free-list + manual `VACUUM` (9b) + auto-VACUUM (9c); `IS NULL` / `IS NOT NULL` (9d); `GROUP BY` + aggregates + `DISTINCT` + `LIKE` + `IN` (9e); four flavors of `JOIN` — INNER, LEFT, RIGHT, FULL OUTER (9f); prepared statements + `?` parameter binding with a per-connection LRU plan cache (9g); HNSW probe widened to cosine + dot via `WITH (metric = …)` (9h); `PRAGMA` dispatcher with the `auto_vacuum` knob (9i)
- Benchmarks against SQLite + DuckDB (Phase 10 complete, SQLR-4 / SQLR-16): twelve-workload bench harness with a pluggable `Driver` trait, criterion-driven, pinned-host runs published. See [`docs/benchmarks.md`](benchmarks.md).
- Phase 11 (concurrent writes via MVCC + `BEGIN CONCURRENT`, SQLR-22) is in flight. **11.1 → 11.7: shipped.** Engine + SDK error propagation: `Connection` is `Send + Sync`; `Connection::connect()` mints sibling handles. `sqlrite::mvcc` exposes `MvccClock`, `ActiveTxRegistry`, `MvStore`, `ConcurrentTx`. WAL header v1 → v2 persists the clock high-water mark. `PRAGMA journal_mode = mvcc;` opts a database into MVCC. `BEGIN CONCURRENT` writes commit-validate against `MvStore` and abort with `SQLRiteError::Busy`. Reads via `Statement::query` see the BEGIN-time snapshot. Per-commit GC + `vacuum_mvcc()` bound the version chain growth. C FFI / Python / Node / Go all propagate `Busy` / `BusySnapshot` as typed retryable errors. **11.8 multi-handle SDK shape: shipped on this branch.** The FFI's `sqlrite_connect_sibling`, Python's `Connection.connect()`, and Node's `db.connect()` mint sibling handles that share backing state — closes the end-to-end gap from 11.7 where `BusyError` was reachable but not exerciseable through any SDK. Plan: [`docs/concurrent-writes-plan.md`](concurrent-writes-plan.md).
- Phase 11 (concurrent writes via MVCC + `BEGIN CONCURRENT`, SQLR-22) is in flight. **11.1 → 11.9: shipped.** Engine + SDK error propagation: `Connection` is `Send + Sync`; `Connection::connect()` mints sibling handles. `sqlrite::mvcc` exposes `MvccClock`, `ActiveTxRegistry`, `MvStore`, `ConcurrentTx`, and the `MvccCommitBatch` / `MvccLogRecord` WAL codec. WAL header v1 → v2 persisted the clock high-water mark; **v2 → v3 (11.9)** adds typed MVCC log-record frames. `PRAGMA journal_mode = mvcc;` opts a database into MVCC. `BEGIN CONCURRENT` writes commit-validate against `MvStore`, abort with `SQLRiteError::Busy`, and now also append an MVCC log-record frame to the WAL — covered by the same fsync as the legacy page commit. Reopen replays those frames into `MvStore` and seeds `MvccClock` past the highest committed `commit_ts`, so the MVCC conflict-detection window survives a process restart. Reads via `Statement::query` see the BEGIN-time snapshot. Per-commit GC + `vacuum_mvcc()` bound version-chain growth. C FFI / Python / Node / Go all propagate `Busy` / `BusySnapshot` as typed retryable errors; the FFI's `sqlrite_connect_sibling`, Python's `Connection.connect()`, and Node's `db.connect()` mint sibling handles that share backing state. Plan: [`docs/concurrent-writes-plan.md`](concurrent-writes-plan.md).
- A fully-automated release pipeline that ships every product to its registry on every release with one human action — Rust engine + `sqlrite-ask` + `sqlrite-mcp` to crates.io, Python wheels to PyPI (`sqlrite`), Node.js + WASM to npm (`@joaoh82/sqlrite` + `@joaoh82/sqlrite-wasm`), Go module via `sdk/go/v*` git tag, plus C FFI tarballs, MCP binary tarballs, and unsigned desktop installers as GitHub Release assets (Phase 6 complete)

See the [Roadmap](roadmap.md) for the full phase plan.
Expand Down
8 changes: 5 additions & 3 deletions docs/concurrent-writes-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,9 +270,11 @@ Goal: more than one `Connection` can target the same `Database` within a process

### Phase 10.5 — Checkpoint + crash recovery

- Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file.
- Crash recovery: on open, replay WAL log records into `MvStore`, then replay pager-level commit frames as today.
- Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not.
> **Status (roadmap 11.9 — May 2026):** The crash-recovery half landed in roadmap Phase 11.9. WAL format is bumped to v3; commits append a typed `MvccCommitBatch` frame before the legacy save's fsync; reopen replays those frames into `MvStore` and seeds `MvccClock` past the highest `commit_ts`. The checkpoint-drain half — folding MVCC log records into pager-level updates and re-enabling the `Mvcc → Wal` journal-mode downgrade — is the remaining slice and stays parked for a follow-up.

- ~~Extend the checkpointer to drain MVCC log records into pager-level updates before folding the WAL into the main file.~~ *Deferred — see status note above.*
- Crash recovery: on open, replay WAL log records into `MvStore`, then replay pager-level commit frames as today. **(Shipped — 11.9.)**
- Tests: kill the process mid-MVCC-commit (between log-record append and version-chain push), reopen, verify the committed transaction is visible and the half-written one is not. **(Shipped — 11.9 covers the clean-drop case which exercises the same recovery codepath; a real OS-kill test is parked with the checkpoint-drain follow-up.)**

### Phase 10.6 — Garbage collection

Expand Down
57 changes: 57 additions & 0 deletions docs/design-decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,63 @@ Each statement inside the transaction runs against the working `tables` clone vi

---

### 12g. MVCC commits piggyback on the legacy save's fsync (Phase 11.9)

**Decision.** `BEGIN CONCURRENT` commits now leave a typed
[`MvccCommitBatch`](../src/mvcc/log.rs) frame in the WAL *before*
the legacy `save_database` runs. The MVCC frame uses
`page_num = MVCC_FRAME_MARKER (u32::MAX)` and `commit_page_count =
None`, so it is **not** fsync'd on its own. The legacy save then
appends its page commits and ends with the existing page-0 commit
frame (which *is* fsync'd). That single fsync flushes everything
buffered behind it, covering the MVCC frame and the page commits
in one durability boundary.

**Why piggyback rather than fsync per MVCC frame.** Each fsync is
the dominant cost of a small commit (often >90% of wall time on
SSDs). Dual fsyncs would double the cost of a `BEGIN CONCURRENT`
commit relative to a legacy commit for no correctness gain — the
two writes already need to be atomic with each other (a crash that
keeps one but loses the other would either resurrect uncommitted
state in `MvStore` or hide a durable legacy update from the
in-memory MVCC index). Sharing the boundary makes the atomicity
free: torn-write recovery already drops dirty frames past the last
commit barrier, and that recovery treats the MVCC frame as just
another dirty frame waiting for its commit-barrier.

**Why the marker is `u32::MAX`.** Page numbers are bounded by the
file's `page_count`, which sits well below `u32::MAX` for any
realistic database (the maximum page-addressable file size at
4 KiB pages is 16 TiB). Choosing the sentinel from outside the
legal range keeps the discriminator a single integer comparison
on the existing frame-header layout — no new flag field, no
binary-incompatible header.

**Why the clock is seeded from `max(header.clock_high_water,
max(commit_ts in WAL))`.** The WAL header's `clock_high_water`
field is only persisted on checkpoint (which fsync's the
truncated WAL). Between checkpoints, the in-memory header is
ahead of the on-disk header — and an unclean process exit drops
that in-memory lead. The MVCC frames themselves are durable, and
each carries its `commit_ts`, so the replay walks the recovered
batches and takes the higher of the two seeds. Without this
maxing step a crash between commits and checkpoint could let a
post-reopen transaction hand out a `begin_ts` *below* an
already-committed version's `end` — an immediate snapshot-isolation
violation.

**What 11.9 deferred.** The checkpoint half of plan-doc Phase
10.5: draining `MvStore` versions back into the pager so a
WAL truncate doesn't lose them, and re-enabling the `Mvcc → Wal`
journal-mode downgrade. The legacy save mirror still covers
durability of the visible row state on the read path, so this
gap is foundation work — not a correctness regression — and
the existing per-commit GC bounds in-memory chain growth.

**Plan-doc reference.** [`concurrent-writes-plan.md`](concurrent-writes-plan.md) §3.3 (durability model), §4.6 (WAL log records), §10.5 (checkpoint integration — partially shipped, see note in plan doc).

---

## Query execution

### 13. `NULL`-as-false in `WHERE` clauses
Expand Down
80 changes: 77 additions & 3 deletions docs/file-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,7 @@ A second file alongside the `.sqlrite`, named `<stem>.sqlrite-wal`, records page
│ 8 │ 4 │ format version (u32 LE) │
│ │ │ 1 = pre-Phase-11 │
│ │ │ 2 = Phase 11.2 — adds clock_high_water │
│ │ │ 3 = Phase 11.9 — adds MVCC log-record frames │
│ 12 │ 4 │ page size (u32 LE) = 4096 │
│ 16 │ 4 │ salt (u32 LE) — rolled each checkpoint │
│ 20 │ 4 │ checkpoint seq (u32 LE) — increments per ckpt │
Expand All @@ -343,9 +344,21 @@ A second file alongside the `.sqlrite`, named `<stem>.sqlrite-wal`, records page
were reserved-zero in v1, so a pre-Phase-11 WAL opens cleanly: the
parser interprets the zeros as `clock_high_water = 0`, which is
indistinguishable from "fresh checkpoint, clock has never advanced."
The next checkpoint rewrites the header at v2 — there's no offline
upgrade step. Forward versions we don't recognize (e.g. v3) error
out with a clean diagnostic rather than misinterpreting the bytes.
The next checkpoint rewrites the header at the current version —
there's no offline upgrade step. Forward versions we don't recognise
error out with a clean diagnostic rather than misinterpreting the
bytes.

**v2 → v3 compatibility.** v3 doesn't change the header layout at
all — only the set of frame kinds the body stream can carry. A v2
reader on a v3 file would still parse every frame correctly *except*
that it would not recognise the MVCC-marker frames and would skip
them silently as if they were unknown page numbers (the page-number
field reads `u32::MAX`). We bump the header anyway so v2 readers
emit the usual "unsupported WAL format version" diagnostic on a v3
WAL, surfacing the mismatch instead of silently losing MVCC
durability. The current build accepts v1..=v3 on open and writes v3
on every new WAL.

### Frames

Expand All @@ -367,6 +380,67 @@ Each frame is `FRAME_HEADER_SIZE + PAGE_SIZE` = **4112 bytes**:
└────────┴────────┴─────────────────────────────────────────────────┘
```

### MVCC log-record frames (Phase 11.9)

When the database is in `journal_mode = mvcc`, a successful `BEGIN
CONCURRENT` commit appends a second frame on top of the legacy page
frames: an MVCC log record that captures the resolved write-set.
The frame uses the same 4112-byte envelope but is distinguished by
the page-number field carrying the sentinel `u32::MAX`
(`MVCC_FRAME_MARKER`). Real page numbers are bounded by file size,
so the sentinel can never collide with a legitimate page frame.

The body carries:

```
┌────────┬────────┬─────────────────────────────────────────────────┐
│ offset │ length │ content │
├────────┼────────┼─────────────────────────────────────────────────┤
│ 0 │ 8 │ magic: "MVCC0001" (ASCII, no NUL) │
│ 8 │ 8 │ commit_ts (u64 LE) │
│ 16 │ 2 │ record count (u16 LE) │
│ 18 │ var. │ records — for each: │
│ │ │ 1 byte op tag (0 = Tombstone, 1 = Present) │
│ │ │ 2 + N table name (length-prefixed) │
│ │ │ 8 rowid (i64 LE) │
│ │ │ if op = 1: column count (u16 LE) + per-column │
│ │ │ (name, type tag, value) tuples │
│ ... │ ... │ zero-padded to PAGE_SIZE │
└────────┴────────┴─────────────────────────────────────────────────┘
```

Value type tags inside a record:

```
0 Null
1 Int — 8 bytes i64 LE
2 Real — 8 bytes f64 LE
3 Text — 4 + N bytes (u32 LE length, then UTF-8 bytes)
4 Bool — 1 byte (0 / 1)
5 Vector — 4 + 4*N bytes (u32 LE length, then f32 LE elements)
```

A batch must fit in the 4096-byte body; encoder rejects oversize
batches with a typed error. Multi-frame batches (for very large
transactions) are a deferred follow-up.

The MVCC frame is appended with `commit_page_count = None` (dirty)
so its own `fsync` is skipped. The very next legacy commit frame
that the same `save_database` writes will `fsync` the whole buffer
— covering both the MVCC frame and the legacy page updates in one
boundary. A crash between the two append calls drops both, which
is the right rollback semantics.

On reopen, the replay loop branches on `page_num`:
`MVCC_FRAME_MARKER` frames are decoded into `MvccCommitBatch` and
held in a pending list that promotes onto the recovered list each
time the next legacy commit frame seals the transaction. The
`Pager` exposes the recovered batches as
`Pager::recovered_mvcc_commits()`; `pager::open_database` drains
them into `Database::mv_store` via `MvStore::push_committed` and
seeds `Database::mvcc_clock` past the highest replayed
`commit_ts`.

### Torn-write recovery

On open the reader walks every frame from `WAL_HEADER_SIZE`, validating salt and checksum. The first invalid or incomplete frame marks the end of the usable log — its bytes and anything after stay on disk but are treated as nonexistent. Callers get a clean in-memory index of `(page → latest-committed-frame-offset)` and a `last_commit_offset` boundary.
Expand Down
14 changes: 11 additions & 3 deletions docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -657,7 +657,7 @@ Bounds in-memory growth of the [`MvStore`](../src/mvcc/store.rs) version chains.
- **Go SDK**: two new sentinel error values `sqlrite.ErrBusy` / `sqlrite.ErrBusySnapshot`, plus an `IsRetryable(err error) bool` helper. `wrapErr` recognises the new FFI status codes and wraps the engine message with `fmt.Errorf("…: %w", ErrBusy)`.
- **WASM SDK** — deliberately untouched (browser is single-threaded; multi-handle shape not yet exposed).

### 🚧 Phase 11.8 — Multi-handle SDK shape *(in progress, was plan-doc 11.8's other half; promoted ahead of plan-doc 11.5 again because the 11.7 retry-error machinery can't be exercised end-to-end through any SDK until siblings are reachable)*
### Phase 11.8 — Multi-handle SDK shape *(in progress, was plan-doc 11.8's other half; promoted ahead of plan-doc 11.5 again because the 11.7 retry-error machinery can't be exercised end-to-end through any SDK until siblings are reachable)*

Each pre-11.8 SDK `connect()` / `new Database()` built an *isolated* backing DB; the 11.7 `BusyError` / `errorKind` / `ErrBusy` plumbing was reachable but not actually triggerable from user code. This slice exposes the engine's `Connection::connect()` through every reachable language so apps can mint sibling handles that share state, and finally exercise the 11.7 retry idioms with real cross-handle conflicts.

Expand All @@ -669,9 +669,17 @@ Each pre-11.8 SDK `connect()` / `new Database()` built an *isolated* backing DB;

Each SDK gets end-to-end tests that exercise `BEGIN CONCURRENT` cross-handle conflicts: two sibling handles, two concurrent transactions on the same row, the second commit hits the SDK's typed retryable error, retry succeeds.

### Phase 11.9 — Checkpoint integration + crash recovery *(planned, plan-doc "Phase 10.5"; renumbered to follow SDK propagation because durability via the legacy `save_database` mirror already works in v0; this slice is foundation work for cross-process MVCC and column-level WAL deltas)*
### Phase 11.9 — WAL log-record durability + crash recovery *(plan-doc "Phase 10.5"; renumbered to follow SDK propagation because durability via the legacy `save_database` mirror already worked in v0)*

MVCC log-record WAL frame format (the deferred 11.4 piece). Commit appends log records pre-`save_database`. Reopen replays log records into `MvStore`. Checkpoint drains `MvStore` versions back into the pager (so `Mvcc → Wal` becomes legal once the store is empty). Crash-recovery test: kill mid-commit between log-record append and version-chain push; reopen; verify the committed transaction is visible and the half-written one is not.
MVCC commits now leave a typed log-record frame in the WAL on top of the existing page-level commit. The MVCC frame is appended before the legacy save's commit-frame fsync, so a single fsync covers both: a crash either keeps both or loses both. On reopen, the WAL replay decodes every MVCC frame and re-pushes the row versions into `MvStore`; the in-memory MVCC clock is seeded past the highest replayed `commit_ts` so post-restart transactions can never hand out a regressed `begin_ts`.

- **WAL format version bumped to v3.** v1 / v2 are still readable (replay just sees zero MVCC frames); v3 adds the MVCC frame marker (`page_num = u32::MAX`) and the body codec.
- **Frame body codec** ([`src/mvcc/log.rs`](../src/mvcc/log.rs)): `MvccCommitBatch { commit_ts, records }` encoded with magic `MVCC0001`, then `commit_ts` (u64 LE), record count (u16 LE), then per-record `(op tag, table name, rowid, optional column-value pairs)`. Everything fits in the 4 KiB frame body; the encoder surfaces a typed error if a single commit overflows (multi-frame batches are a deferred slice).
- **Append path** ([`src/connection.rs`](../src/connection.rs) `commit_concurrent`): after validation passes, the resolved write-set is encoded into a batch, appended to the WAL (no fsync), and then `save_database` runs and seals the transaction with its own fsync. The clock high-water in the WAL header is also bumped so a future checkpoint persists it.
- **Replay path** ([`src/sql/pager/mod.rs`](../src/sql/pager/mod.rs) `replay_mvcc_into_db`): drains `Pager::recovered_mvcc_commits` into `MvStore` and observes the clock past `max(header.clock_high_water, max(commit_ts))`. Replay is unconditional — `JournalMode::Wal`-mode databases simply see zero frames.
- **Tests** ([`src/connection.rs`](../src/connection.rs)): six new cases cover round-trip persistence, multi-row batches, ROLLBACK-leaves-no-frame, legacy-commit-leaves-no-frame, multi-commit replay after an unclean close, and clock-seeding past the last `commit_ts`.

**Out of scope for 11.9** (parked for a follow-up): checkpoint draining the `MvStore` versions back into the pager (which would let `set_journal_mode(Mvcc → Wal)` succeed); a real OS-level kill-mid-commit test (the existing test uses a clean drop, which exercises the same crash-recovery codepath because the WAL is the durable record).

### Phase 11.10 — Indexes under MVCC *(deferred-by-design, plan-doc "Phase 10.7")*

Expand Down
Loading
Loading