Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/architecture/TOPOLOGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ verisimiser/
│ ├── src/manifest/ — TOML manifest parsing (verisimiser.toml)
│ ├── src/tier1/ — Tier 1 piggyback data types
│ │ ├── drift.rs — DriftReport, DriftCategory (8 categories)
│ │ ├── provenance.rs — ProvenanceRecord, SHA-256 hash chain
│ │ ├── provenance.rs — re-exports abi::ProvenanceEntry; future write-path helpers (V-L1-C1)
│ │ └── temporal.rs — TemporalVersion, point-in-time snapshots
│ ├── src/tier2/ — Tier 2 overlay stubs (graph, vector, tensor, semantic, document, spatial)
│ ├── src/intercept/ — Per-backend interception strategies
Expand Down
202 changes: 202 additions & 0 deletions docs/theory/provenance-threat-model.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
// SPDX-License-Identifier: PMPL-1.0-or-later
// Copyright (c) 2026 Jonathan D.A. Jewell (hyperpolymath) <j.d.a.jewell@open.ac.uk>
= Provenance threat model
:toc: left
:toclevels: 3
:icons: font

This document fixes what the Provenance concern's hash chain proves and
what it doesn't. It binds the design choices made in V-L2-C1, V-L2-C2,
V-L2-L1, V-L2-L2, and the ADR-0004 follow-up.

Resolves: https://github.com/hyperpolymath/verisimiser/issues/25[V-L1-B1].

== Scope

In scope:: the `Provenance` octad concern as implemented by
`ProvenanceEntry` in `src/abi/mod.rs` plus the sidecar table
`verisimdb_provenance_log` plus (post V-L2-L1) the `chain_head` table.

Out of scope:: denial-of-service against the sidecar; side-channels
(timing, cache); tampering of the target database itself
(verisimiser only sees what its interceptors intercept); retroactive
provenance for pre-existing rows (the genesis entry for an entity
attests its existence at the moment verisimiser started observing it,
not before); cryptographic compromise of SHA-256.

== Adversaries

Four adversaries cover the relevant capability axes. Each is a
*lattice point*; real attackers combine capabilities.

[cols="1,3"]
|===
| Tag | Capability

| **R** | Read-only — can read both the target database and the
sidecar. No write to either. Models: a forensic auditor;
a leaked replica; a debugging copy on a laptop.

| **SW** | Sidecar-Write — can append new rows to
`verisimdb_provenance_log` and `verisimdb_temporal_versions` but
**cannot delete or rewrite existing rows**. Models: a sidecar
configured append-only (filesystem-level WORM, S3 Object Lock,
SQLite + revoked-DELETE/UPDATE permissions); also models a buggy
verisimiser daemon that double-writes.

| **SR** | Sidecar-Rewrite — can rewrite or delete arbitrary rows
in the sidecar. Models: root on the sidecar host; compromised
application credential with full sidecar privileges; a backup
operator restoring an older sidecar snapshot.

| **CK** | Clock-skew — can write entries (via SW or SR) with
timestamps that lie. Models: a system clock that drifts; an
adversary who controls the clock source; coordinated backdating.
|===

== Per-adversary protection matrix

For each adversary, what the chain proves about each field:
**P** = protected (tampering detected),
**N** = not protected,
**C** = conditionally protected (see note).

[cols="2,1,1,1,1"]
|===
| Field | R | SW | SR | SR+CK

| Genesis existence / order | P | P | N | N
| `previous_hash` of any entry | P | P | C-1 | C-1
| `entity_id` of any entry | P | P | C-1 | C-1
| `operation` of any entry | P | P | C-1 | C-1
| `actor` of any entry | P | P | C-1 | C-1
| `timestamp` of any entry | P | P | C-1 | N (CK falsifies)
| `before_snapshot` of any entry | P | P | C-1 | C-1
| `transformation` of any entry | P | P | C-1 | C-1
| Absence of an entry | C-2 | C-2 | N | N
| Total ordering across entities | N | N | N | N
|===

**C-1** — under SR (or SR+CK), the adversary can rewrite an
arbitrary suffix of the chain (recomputing hashes as they go). What's
preserved against SR is **only the prefix up to the most-recent
externally attested hash** (e.g. a hash periodically signed by an
out-of-band notary, anchored to an append-only log, or published to
a transparency service). Without an external anchor, the chain
proves *nothing* against SR.

**C-2** — absence is provable only if every legitimate append goes
through verisimiser. Direct writes to the target database that
bypass interception are invisible to the chain; the chain cannot
attest to what it never saw.

== Field coverage requirement

A direct consequence of C-1 / C-2 and the per-adversary matrix:

[NOTE]
====
Every field that an auditor will rely on for forensic purposes
**must** participate in the hash. `actor`, `before_snapshot`, and
`transformation` are all such fields — they are the audit. If they
are not in the preimage, the chain protects them against R and SW
only by *coincidence* (because the row itself was hash-keyed in the
DB), not by design.

This document therefore *requires* V-L2-C1: the preimage must cover
`previous_hash`, `entity_id`, `operation`, `actor`, `timestamp`,
`before_snapshot`, `transformation`. Any future field added to
`ProvenanceEntry` must either be added to the preimage or
explicitly recorded here with a justification for its omission.
====

== Canonical encoding requirement

A direct consequence of "the hash protects the field" being a
*function*, not a relation:

[NOTE]
====
Two distinct preimages must produce distinct hashes (collision
resistance is SHA-256's job). Two *equal* preimages must produce
equal hashes (canonicalisation is our job). The encoding must:

. Domain-separate verisimiser provenance hashes from any other
hash the system computes (`b"verisim-prov-v1\0"`).
. Length-prefix variable-length fields so concatenation is
unambiguous.
. Use a canonical timestamp encoding (V-L2-C2:
`i64_le(secs) || u32_le(nanos)`), not a string representation that
admits multiple valid forms for the same instant.
====

== Append serialisation requirement

A direct consequence of "previous_hash chains entries linearly":

[NOTE]
====
Two writers cannot independently chain from the same `previous_hash`
without forking the chain. Verisimiser must serialise appends
per-entity. V-L2-L1 specifies the write-path lock; V-L2-L2 specifies
the database UNIQUE constraint that makes forks structurally
impossible even if the lock is bypassed.

The chain is *per-entity-serial* but *cross-entity-parallel*.
A global serial order across entities is *not* a requirement
(see "Total ordering" in the matrix above).
====

== Anchor / notary (future)

Protection against SR requires an *external anchor* that the
adversary cannot rewrite. Options, none of which this document
mandates yet:

. **Periodic notarisation** — every N minutes, sign the latest
chain_head with a key not held on the sidecar host, and publish
the signature to an out-of-band log.
. **Transparency log** — submit each `chain_head` update to an
external append-only log (Sigstore-style).
. **Replication to immutable storage** — write each new entry to
S3 Object Lock (or equivalent) as a defence in depth.

The threat model leaves the choice for ADR-0005 once a deployment
context exists.

== Out-of-band assumptions

. The sidecar host's clock is monotonic and within bounded skew of
real time. Without this, all timestamps are advisory (see CK in the
matrix).
. Verisimiser's process integrity is assumed — a verisimiser binary
that has been swapped for a malicious one can produce a hash-chain
that verifies against itself but attests to nothing real. Binary
provenance is a separate concern (out of scope here).
. SHA-256 is collision-resistant in the cryptographic sense for the
lifetime of the audit window.

== Open questions

. Should `Option<String>` fields (`before_snapshot`, `transformation`)
encode `None` vs `Some("")` distinctly? The current proposal collapses
them (both encode as `u64_le(0)` length). Document explicitly that
the chain treats "no snapshot" and "empty snapshot" identically; if a
future use case requires distinguishing them, a single sentinel byte
(`0x00` for None, `0x01` for Some) prefixed inside the length-prefixed
slot resolves it.
. Should the chain include an explicit `chain_id` covering all of an
entity's entries (in addition to chaining via `previous_hash`)? Cheap
defence in depth against entity_id confusion; defer to ADR-0004.

== Cross-references

* V-L2-C1 — implements the field coverage + domain separation
* V-L2-C2 — implements canonical timestamp encoding
* V-L2-C3 — positive tamper-detection tests
* V-L2-C4 — removes the wontfix test that codified the C-1 gap
* V-L2-L1 — per-entity write-path serialisation
* V-L2-L2 — UNIQUE INDEX(entity_id, previous_hash) defence in depth
* V-L2-N1 — deduplicates the type used here (ProvenanceEntry vs
ProvenanceRecord)
* ADR-0004 (future) — records the binding choices made here
Loading
Loading