hyperpolymath · hyperpolymath · May 13, 2026 · May 13, 2026
diff --git a/docs/architecture/TOPOLOGY.md b/docs/architecture/TOPOLOGY.md
@@ -12,7 +12,7 @@ verisimiser/
 │   ├── src/manifest/         — TOML manifest parsing (verisimiser.toml)
 │   ├── src/tier1/            — Tier 1 piggyback data types
 │   │   ├── drift.rs          — DriftReport, DriftCategory (8 categories)
-│   │   ├── provenance.rs     — ProvenanceRecord, SHA-256 hash chain
+│   │   ├── provenance.rs     — re-exports abi::ProvenanceEntry; future write-path helpers (V-L1-C1)
 │   │   └── temporal.rs       — TemporalVersion, point-in-time snapshots
 │   ├── src/tier2/            — Tier 2 overlay stubs (graph, vector, tensor, semantic, document, spatial)
 │   ├── src/intercept/        — Per-backend interception strategies

diff --git a/docs/theory/provenance-threat-model.adoc b/docs/theory/provenance-threat-model.adoc
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: PMPL-1.0-or-later
+// Copyright (c) 2026 Jonathan D.A. Jewell (hyperpolymath) <j.d.a.jewell@open.ac.uk>
+= Provenance threat model
+:toc: left
+:toclevels: 3
+:icons: font
+
+This document fixes what the Provenance concern's hash chain proves and
+what it doesn't. It binds the design choices made in V-L2-C1, V-L2-C2,
+V-L2-L1, V-L2-L2, and the ADR-0004 follow-up.
+
+Resolves: https://github.com/hyperpolymath/verisimiser/issues/25[V-L1-B1].
+
+== Scope
+
+In scope:: the `Provenance` octad concern as implemented by
+`ProvenanceEntry` in `src/abi/mod.rs` plus the sidecar table
+`verisimdb_provenance_log` plus (post V-L2-L1) the `chain_head` table.
+
+Out of scope:: denial-of-service against the sidecar; side-channels
+(timing, cache); tampering of the target database itself
+(verisimiser only sees what its interceptors intercept); retroactive
+provenance for pre-existing rows (the genesis entry for an entity
+attests its existence at the moment verisimiser started observing it,
+not before); cryptographic compromise of SHA-256.
+
+== Adversaries
+
+Four adversaries cover the relevant capability axes. Each is a
+*lattice point*; real attackers combine capabilities.
+
+[cols="1,3"]
+|===
+| Tag | Capability
+
+| **R**  | Read-only — can read both the target database and the
+sidecar. No write to either. Models: a forensic auditor;
+a leaked replica; a debugging copy on a laptop.
+
+| **SW** | Sidecar-Write — can append new rows to
+`verisimdb_provenance_log` and `verisimdb_temporal_versions` but
+**cannot delete or rewrite existing rows**. Models: a sidecar
+configured append-only (filesystem-level WORM, S3 Object Lock,
+SQLite + revoked-DELETE/UPDATE permissions); also models a buggy
+verisimiser daemon that double-writes.
+
+| **SR** | Sidecar-Rewrite — can rewrite or delete arbitrary rows
+in the sidecar. Models: root on the sidecar host; compromised
+application credential with full sidecar privileges; a backup
+operator restoring an older sidecar snapshot.
+
+| **CK** | Clock-skew — can write entries (via SW or SR) with
+timestamps that lie. Models: a system clock that drifts; an
+adversary who controls the clock source; coordinated backdating.
+|===
+
+== Per-adversary protection matrix
+
+For each adversary, what the chain proves about each field:
+**P** = protected (tampering detected),
+**N** = not protected,
+**C** = conditionally protected (see note).
+
+[cols="2,1,1,1,1"]
+|===
+| Field | R | SW | SR | SR+CK
+
+| Genesis existence / order        | P | P | N    | N
+| `previous_hash` of any entry     | P | P | C-1  | C-1
+| `entity_id` of any entry         | P | P | C-1  | C-1
+| `operation` of any entry         | P | P | C-1  | C-1
+| `actor` of any entry             | P | P | C-1  | C-1
+| `timestamp` of any entry         | P | P | C-1  | N (CK falsifies)
+| `before_snapshot` of any entry   | P | P | C-1  | C-1
+| `transformation` of any entry    | P | P | C-1  | C-1
+| Absence of an entry              | C-2 | C-2 | N    | N
+| Total ordering across entities   | N | N | N    | N
+|===
+
+**C-1** — under SR (or SR+CK), the adversary can rewrite an
+arbitrary suffix of the chain (recomputing hashes as they go). What's
+preserved against SR is **only the prefix up to the most-recent
+externally attested hash** (e.g. a hash periodically signed by an
+out-of-band notary, anchored to an append-only log, or published to
+a transparency service). Without an external anchor, the chain
+proves *nothing* against SR.
+
+**C-2** — absence is provable only if every legitimate append goes
+through verisimiser. Direct writes to the target database that
+bypass interception are invisible to the chain; the chain cannot
+attest to what it never saw.
+
+== Field coverage requirement
+
+A direct consequence of C-1 / C-2 and the per-adversary matrix:
+
+[NOTE]
+====
+Every field that an auditor will rely on for forensic purposes
+**must** participate in the hash. `actor`, `before_snapshot`, and
+`transformation` are all such fields — they are the audit. If they
+are not in the preimage, the chain protects them against R and SW
+only by *coincidence* (because the row itself was hash-keyed in the
+DB), not by design.
+
+This document therefore *requires* V-L2-C1: the preimage must cover
+`previous_hash`, `entity_id`, `operation`, `actor`, `timestamp`,
+`before_snapshot`, `transformation`. Any future field added to
+`ProvenanceEntry` must either be added to the preimage or
+explicitly recorded here with a justification for its omission.
+====
+
+== Canonical encoding requirement
+
+A direct consequence of "the hash protects the field" being a
+*function*, not a relation:
+
+[NOTE]
+====
+Two distinct preimages must produce distinct hashes (collision
+resistance is SHA-256's job). Two *equal* preimages must produce
+equal hashes (canonicalisation is our job). The encoding must:
+
+. Domain-separate verisimiser provenance hashes from any other
+hash the system computes (`b"verisim-prov-v1\0"`).
+. Length-prefix variable-length fields so concatenation is
+unambiguous.
+. Use a canonical timestamp encoding (V-L2-C2:
+  `i64_le(secs) || u32_le(nanos)`), not a string representation that
+  admits multiple valid forms for the same instant.
+====
+
+== Append serialisation requirement
+
+A direct consequence of "previous_hash chains entries linearly":
+
+[NOTE]
+====
+Two writers cannot independently chain from the same `previous_hash`
+without forking the chain. Verisimiser must serialise appends
+per-entity. V-L2-L1 specifies the write-path lock; V-L2-L2 specifies
+the database UNIQUE constraint that makes forks structurally
+impossible even if the lock is bypassed.
+
+The chain is *per-entity-serial* but *cross-entity-parallel*.
+A global serial order across entities is *not* a requirement
+(see "Total ordering" in the matrix above).
+====
+
+== Anchor / notary (future)
+
+Protection against SR requires an *external anchor* that the
+adversary cannot rewrite. Options, none of which this document
+mandates yet:
+
+. **Periodic notarisation** — every N minutes, sign the latest
+chain_head with a key not held on the sidecar host, and publish
+the signature to an out-of-band log.
+. **Transparency log** — submit each `chain_head` update to an
+external append-only log (Sigstore-style).
+. **Replication to immutable storage** — write each new entry to
+S3 Object Lock (or equivalent) as a defence in depth.
+
+The threat model leaves the choice for ADR-0005 once a deployment
+context exists.
+
+== Out-of-band assumptions
+
+. The sidecar host's clock is monotonic and within bounded skew of
+real time. Without this, all timestamps are advisory (see CK in the
+matrix).
+. Verisimiser's process integrity is assumed — a verisimiser binary
+that has been swapped for a malicious one can produce a hash-chain
+that verifies against itself but attests to nothing real. Binary
+provenance is a separate concern (out of scope here).
+. SHA-256 is collision-resistant in the cryptographic sense for the
+lifetime of the audit window.
+
+== Open questions
+
+. Should `Option<String>` fields (`before_snapshot`, `transformation`)
+encode `None` vs `Some("")` distinctly? The current proposal collapses
+them (both encode as `u64_le(0)` length). Document explicitly that
+the chain treats "no snapshot" and "empty snapshot" identically; if a
+future use case requires distinguishing them, a single sentinel byte
+(`0x00` for None, `0x01` for Some) prefixed inside the length-prefixed
+slot resolves it.
+. Should the chain include an explicit `chain_id` covering all of an
+entity's entries (in addition to chaining via `previous_hash`)? Cheap
+defence in depth against entity_id confusion; defer to ADR-0004.
+
+== Cross-references
+
+* V-L2-C1 — implements the field coverage + domain separation
+* V-L2-C2 — implements canonical timestamp encoding
+* V-L2-C3 — positive tamper-detection tests
+* V-L2-C4 — removes the wontfix test that codified the C-1 gap
+* V-L2-L1 — per-entity write-path serialisation
+* V-L2-L2 — UNIQUE INDEX(entity_id, previous_hash) defence in depth
+* V-L2-N1 — deduplicates the type used here (ProvenanceEntry vs
+  ProvenanceRecord)
+* ADR-0004 (future) — records the binding choices made here