Skip to content

Commit 8f504e3

Browse files
g-talbotclaude
andauthored
test: Parquet merge pipeline verification suite (#6369)
* test: add proptest for planner maturity filtering + shared test helpers Property test verifies that no merge task ever contains a split that should have been filtered: mature by ops (>= max_merge_ops), mature by size (>= target), time-matured (created_at + maturation_period <= now), or missing a window. Generates random splits across the maturity boundary and tests at the actor level. Also makes test helpers pub(super) so sibling test modules (sketch, crash, multi-round) can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add sketch split integration test for Parquet merge pipeline Verifies that sketch splits dispatch to the correct metastore RPCs (stage_sketch_splits, publish_sketch_splits) and that the merged output has ParquetSplitKind::Sketches with correct metadata. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add crash/restart and multi-round merge integration tests Crash/restart test: - Injects publish failure on 2nd call → pipeline detects failure, kills actors, respawns - Verifies list_metrics_splits called on respawn (re-seeding) - Verifies pipeline generation >= 2 (respawn occurred) - Verifies original splits eventually replaced (no data loss) Multi-round merge test: - 4 input splits → 2 round-1 merges → 1 round-2 merge - Verifies num_merge_ops=2 on final output - Verifies all original + intermediate splits replaced - Verifies MC-1: total rows preserved across lifecycle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * spec: TLA+ specification for merge pipeline shutdown drain protocol Models the two-phase shutdown: DisconnectMergePlanner breaks feedback, RunFinalizeMergePolicyAndQuit drains cold windows, then in-flight merges complete. Safety invariants: - NoSplitLoss: every merge input is published or in-flight - NoDuplicateMerge: no split in multiple concurrent merges - FinalizeWithinBound: at most MaxFinalizeOps finalize operations - ShutdownOnlyWhenDrained: shutdown requires empty in-flight set Liveness: - ShutdownEventuallyCompletes under weak fairness Two configs: _small (hundreds of states) and full (larger space). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add Stateright DST model for merge pipeline with crash/restart Model checks 5 properties across ingest, merge, crash, and restart: - MC-1 lifecycle: total rows preserved (no loss/duplication) - Bounded WA: merge_ops never exceeds max_merge_ops - No duplicate merge: no split in multiple concurrent merges - No orphan after restart: all immature splits re-seeded - MP-1: level homogeneity (by construction) Small model (~instant), full model (~seconds, gated behind #[ignore]). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: nightly rustfmt import ordering across all new test files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: update TLA+ CLAUDE.md with macOS setup and spec catalog - Document the actual tla2tools.jar path from brew cask tla+-toolbox - Add "Run All Specs" one-liner for quick verification - Add spec catalog with state counts and invariant names - Remove stale references to brew formula (doesn't exist) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: replace match with `?` operator in CompleteMerge handler CI clippy lint (`clippy::question-mark`, -D warnings) rejected the explicit match-and-return-None pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec: extend MergePipelineShutdown to cover crash/restart and multi-lifetime Phase A of the verification-gap closure (PR #6369). The original spec only modeled orderly two-phase shutdown — no Crash action, no Restart, no row-conservation claim. The four invariants were either trivially true by construction or static set-membership, so the spec proved nothing about the failure modes that motivated it. This rewrite: - Adds Crash and Restart actions. Crash invalidates in-memory state at any point in the pipeline; uploaded-but-unpublished merge outputs become orphan_outputs. Restart re-seeds cold_windows from durable published_splits (models fetch_immature_splits). - Splits CompleteMerge into UploadMergeOutput then PublishMergeAndFeedback so a Crash between the two phases is reachable. This is what catches the "leaked but invisible" failure mode — output blob in S3 with no metastore reference. The model proves these are object-store-only garbage (LeakIsObjectStoreOnly), not durable data loss. - Adds split_rows / split_merge_ops / split_window ghost functions so RowsConserved expresses the strong "no data loss, no duplication" property: total ingested rows equal sum of rows in published_splits in every reachable state. - Adds NoOrphanWhenConnected (state invariant), RestartReSeedsAllImmature (action property) and NoPersistentOrphan (liveness leads-to). Together these capture: while connected the planner sees every immature split; every Restart correctly recovers all immature splits; in any run with restart budget remaining, orphans always eventually clear. - Extends Restart to fire after graceful shutdown too (multi-lifetime), so the cross-process recovery claim is explicit. Bounded by MaxRestarts. The replaced NoOrphanAfterRestart invariant fired during initial TLC runs — the failing trace was a *legitimate* state in production (publish during shutdown disconnect leaves the output untracked by the planner until the next process invocation). The fix wasn't to remove the invariant but to split the safety claim from the recovery claim: NoOrphanWhenConnected for the steady state, RestartReSeedsAllImmature for the recovery transition, NoPersistentOrphan for cross-lifetime liveness. Recorded as a sesh-mode rule: never silently weaken an invariant that produced a counter-example. Single config drops the small/full split — full config now runs in ~12s on a workstation (217,854 distinct states, 11 invariants + 3 temporal properties verified). The states/ directory dropped by TLC is now gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * spec: add merge-chain config alongside multi-lifetime primary The primary `MergePipelineShutdown.cfg` is now optimized for multi-lifetime exercise (MaxIngests=3, MaxRestarts=2) — 15,732 states in ~1s — covering the cross-process recovery claim added in the previous commit. The new `MergePipelineShutdown_chains.cfg` keeps MaxIngests=4 with MaxRestarts=1 for deeper merge-chain coverage (217,854 states, ~10s): exercises level-0 → level-1 → level-2-mature compaction so BoundedWriteAmp is checked across the full chain and concurrent in-flight merges interact. Combined run is ~11s. Both share the same invariant + property set; only the constants differ. The TLA+ run-all loop in CLAUDE.md now globs every .cfg and resolves the matching .tla, supporting any number of configs per spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: share state and predicates between Stateright model and invariants Phase B of the verification-gap closure (PR #6369). Eliminates drift between the merge-pipeline model and any production-side checks by making them evaluate the *same* Rust functions on the *same* state. quickwit-dst/src/invariants/merge_pipeline.rs (new): - MergePipelineState struct mirroring the TLA+ VARIABLES block (planner_alive, in_flight_merges, cold_windows, published_splits, splits, orphan_outputs, lifecycle counters, total_ingested_rows ghost) - Pure-function predicates corresponding 1:1 to TLA+ invariants: rows_conserved, bounded_write_amp, no_split_loss, no_duplicate_merge, no_orphan_in_planner, no_orphan_when_connected, leak_is_object_store_only, mp1_level_homogeneity, restart_re_seeds_all_immature - Helper orphan_set mirrors the TLA+ OrphanSet operator - 17 unit tests covering passing/failing paths for each predicate quickwit-dst/src/invariants/registry.rs: - Adds MP4..MP11 invariant IDs (preserved short-code naming convention) - Each maps to a TLA+ invariant from MergePipelineShutdown.tla quickwit-dst/src/models/merge_pipeline.rs: - State type is now invariants::merge_pipeline::MergePipelineState (literal sharing — no parallel definition, no conversion layer) - CompleteMerge split into UploadMergeOutput + PublishMergeAndFeedback to match the TLA+ multi-phase merge completion. A Crash between phases orphans the upload (orphan_outputs) without losing data. - Restart re-seeds cold_windows from durable published_splits (mirrors fetch_immature_splits) and resets shutdown_complete / finalize state, modelling a fresh process invocation. Bounded by max_restarts. - DrainComplete added so graceful-shutdown is a distinct terminal action (matching TLA+). - properties() calls the shared predicate functions instead of inline closures — model and runtime checks evaluate identical Rust code. - Three configs: small (fast iteration), multi_lifetime (3 lifetimes, matches MergePipelineShutdown.cfg), deep_chains (level 0→1→2 mature, matches MergePipelineShutdown_chains.cfg). - Restart's next_state debug_asserts MP-11 (restart_re_seeds_all_immature) on its post-state, encoding the action property as a runtime check. All 50 tests pass under `cargo nextest run -p quickwit-dst --features model-checking` (33 invariant unit tests + 4 merge-pipeline model configs + 13 other models). Multi-lifetime checks in 0.1s, deep-chains in 0.6s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: wire merge-pipeline invariant checks into production code Phase C of the verification-gap closure (PR #6369). Production code now evaluates the *same* invariant predicates that the Stateright model and TLA+ spec verify, with results forwarded to the existing DogStatsD recorder (already wired in quickwit-cli/src/logger.rs). Every check emits `pomsky.invariant.checked{invariant=MP-X}` and, on failure, `pomsky.invariant.violated{invariant=MP-X}` — usable directly as Datadog alert criteria. ParquetMergeExecutor (post-merge, pre-uploader): - MP-2 (HasMinimumSplits): merge has >= 2 inputs - MP-1 (LevelHomogeneity): all inputs share num_merge_ops level - MP-3 (ScopeCompatibility): all inputs share sort_fields and window - MP-4 (RowsConserved): sum of input num_rows == sum of output num_rows. Empty-output path checks input rows are all zero (otherwise data was silently dropped). The non-empty path checks the strong row-preservation property — the same MC-1 / RowsConserved invariant from the TLA+ spec. ParquetMergePlanner::new (post-restart re-seed): - MP-11 (RestartReSeedsAllImmature): every split the merge policy classifies as still-immature (and which has a window for compaction) is recorded in scoped_young_splits and known_split_ids. Filters out mature splits, time-expired splits, and pre-Phase-31 splits before comparing — these are intentional drops, not the failure mode MP-11 protects against. Cargo.toml: quickwit-indexing now depends on quickwit-dst for the shared invariant module and check_invariant! macro. Also fixes a pre-existing rustdoc warning: the parquet_merge_planner docs referenced `[`MergePlanner`]` as an intra-doc link, but the type lives in another crate and can't resolve. Changed to plain backticks. All 52 metrics_pipeline tests pass; clippy + nightly fmt + machete + cargo doc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: trace conformance — replay production events through the formal model Phase D of the verification-gap closure (PR #6369). The TLA+ spec and Stateright model are now backed by *production code that actually emits events at every modeled transition*, and a test replays those events through the same predicates the model verifies. A divergence between production and the model surfaces here as a predicate failure on real production state. This closes the loop: TLA+ → Stateright (Phase A+B) → production checks (Phase C) → trace conformance (Phase D). quickwit-dst/src/events/merge_pipeline.rs (new): - MergePipelineEvent enum with one variant per modeled action, using production types (String split IDs, Range<i64> windows). Notably no Crash variant — process death cannot emit anything; crashes are inferred during replay by the absence of events between actions and a subsequent Restart. - Pluggable observer (fn pointer) following the existing set_invariant_recorder pattern. No-op single-atomic-load when no observer is installed; production hot path stays cheap. Production emission sites (always-on, no feature flag): - ParquetMergePipeline::spawn_pipeline → Restart, after fetch_immature_splits - ParquetMergePipeline::FinishPendingMergesAndShutdownPipeline → DisconnectMergePlanner and RunFinalizeAndQuit - ParquetMergePlanner::send_merge_ops → PlanMerge - ParquetUploader::handle (post-upload, pre-publish) → UploadMergeOutput for merge outputs - Publisher::handle (Parquet publish path) → IngestSplit (when no replaced_split_ids) or PublishMergeAndFeedback (when replacing) quickwit-indexing/.../parquet_merge_pipeline_trace_conformance_test.rs (new): - Two scenarios: 1. Normal happy path: 4 splits → 2 merges → 1 mature output. Replay verifies MP-1, MP-4..MP-10 hold at every state. ~0.6s. 2. Crash mid-cascade: publish failure injected on the 2nd merge forces respawn. Trace covers Restart → PlanMerge → Upload → crash → Restart → re-seed → PlanMerge → Upload → Publish. ~1.1s. - StateMirror reconstructs MergePipelineState from events using a production-id → model-u32 interner, applies the same predicate functions used by the Stateright model. - The mock metastore tracks `staged` and `published` separately so list_metrics_splits returns only what real metastores would (not staged-but-failed). Earlier test-side bug here would have masked divergences; the trace test caught it immediately. What this catches that prior tests do not: - The crash test only verified re-seeding *happened*. The trace test verifies that the post-restart state preserves all formal safety invariants (no orphans the planner can never reclaim, no row drift, no level mixing on re-merge, etc.). - A divergence between production order/atomicity and the model's atomic actions surfaces as either a state-mirror error ("event can't be applied") or a predicate failure with the offending event in the panic message. Both tests pass with the current production code: production behavior matches the TLA+ spec for the scenarios exercised. The test infrastructure is now ready to surface real divergences if they exist, and to be extended with adversarial scenarios (multiple back-to-back crashes, finalize-during-crash, concurrency stress). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: add TODO for CS-3 invariant when compaction_start_time lands CS-3 ("splits before compaction_start_time are never compacted") is defined in the TLA+ model and the InvariantId registry, but the production `ParquetMergePolicyConfig` does not yet expose a `compaction_start_time` field. Adding a runtime check now would have nothing to filter against. Records the gap as a planner-level TODO so the check lands alongside the feature when it's implemented (filter splits in `record_splits_if_necessary` + verify via `check_invariant!`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename metrics_pipeline → parquet_pipeline, update CODEOWNERS The actors under quickwit-indexing/src/actors/metrics_pipeline/ are the parquet-based ingest+compaction pipeline. They currently process metrics, but logs and traces will move onto the same pipeline in later phases — the metrics-specific name has become misleading. Pure rename: - git mv metrics_pipeline → parquet_pipeline - update mod declaration + import paths (metrics_pipeline:: → parquet_pipeline::) - update CODEOWNERS path Function names that incidentally contain "metrics_pipeline" (spawn_metrics_pipeline, spawn_log_or_metrics_pipeline, test_metrics_pipeline_e2e) are NOT renamed in this commit — they describe pipeline kind ("the metrics-flavored pipeline") rather than the module, and would benefit from their own pass that also addresses the eventual log/trace generalization. CODEOWNERS also adds /docs/internals/ to the byoc-metrics path list: the architecture and verification docs there are domain-aligned with quickwit-parquet-engine + quickwit-dst (already in the list) and should be approvable by the same team. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(sesh-mode): generalize don't-weaken rule to user requirements The "STOP, Don't Weaken" section already covered formal verification specs (TLA+ invariants, Stateright properties, DST assertions). It didn't explicitly cover the spec→plan→implementation translation step, which is where silent weakening tends to slip in for English-language user requirements. Restructure the section under one umbrella ("Specs are Load-Bearing") with two subsections: - "Verification check fails" — original content verbatim - "User requirement seems hard or out-of-scope" — five concrete silent-weakening failure modes (granularity downgrade, constraint dropping, strength reduction, MUST→SHOULD, scope reframing), the protocol that mirrors the verification case, and the plan-approval-is-not-spec-approval meta-trap Lesson surfaced from PR-4: the original plan I wrote substituted column-chunk granularity for the page-level streaming the user had asked for, and got plan-document approval rather than spec approval. The new section flags that failure mode by name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(fmt): re-sort actors/mod.rs entries after parquet_pipeline rename The metrics_pipeline → parquet_pipeline rename moved the module name past `packager` in the alphabetical ordering rustfmt's imports_granularity pass enforces. Swap the two entries (both the `mod` declarations and the `pub use` re-exports) to restore order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(codeowners): open docs/internals/ and Cargo.lock to any-write-access Last-matching-rule wins. A path line with no team after it removes the codeowner requirement, so any user with write access can approve PRs that only touch those paths. - `/docs/internals/`: shared architecture + verification docs that any team working on the codebase may need to update. Gating on a single team blocks unrelated work. - `/quickwit/Cargo.lock`: churns on routine dependency bumps and doesn't carry domain-specific review value. Per CODEOWNERS semantics, /docs/internals/ no longer falls under byoc-metrics (which the previous commit had assigned it to). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(codeowners): docs/internals and Cargo.lock approvable by either team Per the GitHub docs, multiple owners on the same line means approval from any ONE of them satisfies the codeowner requirement. Listing both `quickwit-core` and `byoc-metrics` on `/docs/internals/` and `/quickwit/Cargo.lock` lets either team approve PRs scoped to those paths. Replaces the previous commit's "no required owner" formulation, which would have allowed any user with write access. The intent is to keep the review gate but distribute it across both teams that work in this repo, not to remove the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a3c79c9 commit 8f504e3

41 files changed

Lines changed: 4185 additions & 62 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/sesh-mode/SKILL.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,88 @@ Stateright DST Tests Production Metrics
4444
(exhaustive) (simulation) (Observability)
4545
```
4646

47+
## STOP, Don't Weaken — Specs are Load-Bearing
48+
49+
Whether the load-bearing claim is a formal property (TLA+ invariant,
50+
Stateright property, DST assertion) or an English-language user
51+
requirement, the rule is the same: **MUST NOT** silently weaken it.
52+
First action is diagnosis and an explicit conversation with the user,
53+
not modification.
54+
55+
### Verification check fails
56+
57+
When a TLA+ invariant, Stateright property, or DST assertion fails, the
58+
failure has exactly two possible causes:
59+
60+
1. **The implementation/model has a real bug.** Fix the bug.
61+
2. **The property is over-strong** — it asserts something the design does not
62+
actually guarantee. *Sometimes* the right answer is to weaken or replace
63+
the property — but the failing trace was probably also revealing the
64+
real, weaker safety property the design *does* guarantee. Almost never
65+
should the answer be just "remove the property."
66+
67+
**Required protocol**:
68+
- Read the failing trace. State out loud what the property meant to claim,
69+
and what the trace shows.
70+
- If unsure whether (1) or (2) applies — **stop and ask the user**.
71+
Do not silently rewrite the property.
72+
- If (2) applies and you propose to weaken/replace, present the candidate
73+
replacement *and* the safety property the original was reaching for, then
74+
ask before changing. The replacement should usually preserve the spirit
75+
via a different formulation (action property, liveness, narrower precondition)
76+
— not just delete the constraint.
77+
78+
**Forbidden** without explicit user approval:
79+
- Renaming an invariant to make its negation trivially true
80+
- Deleting an invariant that just produced a counter-example
81+
- Adding `=> TRUE` or other no-op weakenings
82+
- Changing `\A` to `\E`, `[]` to `<>`, or similar quantifier flips, when the
83+
motive is to suppress a violation rather than to capture a different claim
84+
85+
### User requirement seems hard or out-of-scope
86+
87+
The same rule applies when translating from the user's spec into a plan,
88+
or from the plan into implementation. If a stated requirement looks hard,
89+
expensive, or "more than this PR needs," **stop and ask first** — do not
90+
decide unilaterally that "close enough" is acceptable.
91+
92+
Silent-weakening moves to watch for:
93+
- **Granularity downgrade**: user asked for page-by-page streaming;
94+
plan says "column-chunk granularity is the natural unit." The memory
95+
bound the user actually wanted is gone.
96+
- **Constraint dropping**: user said "must not OOM under load"; plan
97+
treats it as "bounded in the typical case." The qualifier was doing
98+
load-bearing work and you removed it.
99+
- **Strength reduction**: user said "byte-identical metadata"; plan
100+
says "logically equivalent metadata." Test passes; spec is gone.
101+
- **MUST → SHOULD**: user said "one GET per file"; plan allows "two in
102+
the rare retry case." May be reasonable, but only the user can
103+
authorize it.
104+
- **Reframing as out-of-scope**: user described a property as part of
105+
the goal; plan declares it a follow-up PR. If the user didn't say
106+
"split it," you don't get to.
107+
108+
**Required protocol** mirrors the verification case: quote the user's
109+
original phrasing back to yourself; if you cannot point at the source
110+
phrase that authorizes the weaker version, you are not authorized to
111+
ship it. Surface the gap explicitly: *"you asked for X; my plan does
112+
Y; here's why I considered Y; ok to weaken, or do you want X?"* The
113+
default answer is X.
114+
115+
**Plan-document approval is not spec approval.** When the plan you write
116+
diverges from the spec at write time, the user reviews the diverged
117+
plan, not the original requirement. Accountability for the divergence
118+
is on you — they approved what you wrote, not what they asked for.
119+
Re-read the user's original message before writing code.
120+
121+
### Why this matters
122+
123+
Specs — formal or English — describe the system's promise to its
124+
users. When a check fails or a requirement seems too strict, that has
125+
just told you either that the implementation is wrong, or that you've
126+
been claiming a stronger promise than you actually keep. Both deserve
127+
a conscious decision, never a silent edit.
128+
47129
## Testing Through Production Path
48130

49131
**MUST NOT** claim a feature works unless tested through the actual network stack.

.github/CODEOWNERS

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
# CODEOWNERS — see https://docs.github.com/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners
22
#
3-
# Last matching rule per file wins. Approval from any listed owner satisfies
4-
# the requirement. quickwit-dev is listed on every rule so it can always
5-
# approve; an additional team is listed on the metrics Parquet pipeline paths
6-
# so PRs scoped to those paths can be approved by either team.
3+
# Last matching rule per file wins. Multiple owners listed on the same line
4+
# means approval from ANY ONE of them satisfies the requirement.
75

86
# Default: quickwit-core owns everything
97
* @quickwit-oss/quickwit-core
@@ -13,4 +11,11 @@
1311
/quickwit/quickwit-datafusion/ @quickwit-oss/byoc-metrics
1412
/quickwit/quickwit-df-core/ @quickwit-oss/byoc-metrics
1513
/quickwit/quickwit-dst/ @quickwit-oss/byoc-metrics
16-
/quickwit/quickwit-indexing/src/actors/metrics_pipeline/ @quickwit-oss/byoc-metrics
14+
/quickwit/quickwit-indexing/src/actors/parquet_pipeline/ @quickwit-oss/byoc-metrics
15+
16+
# Shared paths — either team can approve. `docs/internals/` is shared
17+
# architecture + verification docs that any team working on the codebase
18+
# may need to update. `Cargo.lock` churns on routine dependency bumps
19+
# and doesn't carry domain-specific review value.
20+
/docs/internals/ @quickwit-oss/quickwit-core @quickwit-oss/byoc-metrics
21+
/quickwit/Cargo.lock @quickwit-oss/quickwit-core @quickwit-oss/byoc-metrics

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,6 @@ qwdata
3232

3333
# GSD planning artifacts (local only)
3434
.planning
35+
36+
# TLC model checker working directory (state dumps from `tla2tools.jar`)
37+
states/

docs/internals/specs/tla/CLAUDE.md

Lines changed: 64 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -4,61 +4,91 @@ This directory contains formal TLA+ specifications for Quickwit protocols.
44

55
## Setup (One-Time)
66

7-
Install TLA+ tools to a standard location (NOT in this directory):
7+
Install TLA+ Toolbox via Homebrew cask:
88

99
```bash
10-
# Option 1: Homebrew (recommended on macOS)
11-
brew install tlaplus
10+
brew install --cask tla+-toolbox
11+
```
1212

13-
# Option 2: Download to ~/.local/lib
14-
mkdir -p ~/.local/lib
15-
curl -L -o ~/.local/lib/tla2tools.jar \
16-
"https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar"
13+
This installs the TLA+ Toolbox app to `/Applications/TLA+ Toolbox.app`,
14+
which bundles `tla2tools.jar` at:
1715

18-
# Option 3: VS Code extension (for interactive use)
19-
# Install: alygin.vscode-tlaplus
2016
```
17+
/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar
18+
```
19+
20+
No separate `tlc` CLI is installed — run TLC via `java -jar`.
2121

2222
## Running Model Checker
2323

24-
From the repository root:
24+
All commands run from the **repository root** (`quickwit-oss/quickwit/`):
2525

2626
```bash
27-
# If installed via Homebrew:
28-
tlc -config docs/internals/specs/tla/ExampleProtocol.cfg docs/internals/specs/tla/ExampleProtocol.tla
27+
# Quick check (small config, ~seconds):
28+
java -XX:+UseParallelGC -Xmx4g \
29+
-jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
30+
-workers 4 \
31+
-config docs/internals/specs/tla/SomeSpec_small.cfg \
32+
docs/internals/specs/tla/SomeSpec.tla
2933

30-
# If using downloaded jar:
31-
java -XX:+UseParallelGC -Xmx4g -jar ~/.local/lib/tla2tools.jar \
34+
# Full check (larger state space, ~seconds to minutes):
35+
java -XX:+UseParallelGC -Xmx4g \
36+
-jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
3237
-workers 4 \
33-
-config docs/internals/specs/tla/ExampleProtocol.cfg \
34-
docs/internals/specs/tla/ExampleProtocol.tla
38+
-config docs/internals/specs/tla/SomeSpec.cfg \
39+
docs/internals/specs/tla/SomeSpec.tla
3540
```
3641

37-
### Quick vs Full Verification
38-
39-
Some specs have `_small.cfg` variants for faster iteration:
42+
### Run All Specs
4043

4144
```bash
42-
# Quick (~1s, hundreds of states)
43-
tlc -config docs/internals/specs/tla/ExampleProtocol_small.cfg docs/internals/specs/tla/ExampleProtocol.tla
44-
45-
# Full (~minutes, millions of states)
46-
tlc -workers 4 -config docs/internals/specs/tla/ExampleProtocol.cfg docs/internals/specs/tla/ExampleProtocol.tla
45+
cd /path/to/quickwit-oss/quickwit
46+
47+
# Iterate every .cfg next to a .tla and run TLC on each pair. Some specs
48+
# ship multiple configs (e.g. one for fast iteration, one for thorough
49+
# coverage, one focused on a specific behavior); this loop handles all.
50+
for cfg in docs/internals/specs/tla/*.cfg; do
51+
base="$(basename "${cfg%.cfg}")"
52+
# Try <base>.tla as-is; if absent, strip a trailing _suffix
53+
# (e.g. MergePipelineShutdown_chains.cfg -> MergePipelineShutdown.tla).
54+
if [ -f "docs/internals/specs/tla/${base}.tla" ]; then
55+
tla="docs/internals/specs/tla/${base}.tla"
56+
else
57+
tla="docs/internals/specs/tla/${base%_*}.tla"
58+
fi
59+
echo "=== $(basename "$tla") with $(basename "$cfg") ==="
60+
java -XX:+UseParallelGC -Xmx4g \
61+
-jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
62+
-workers 4 -config "$cfg" "$tla"
63+
echo
64+
done
4765
```
4866

49-
## Available Specifications
67+
### Multiple Configs per Spec
5068

51-
| Spec | Config | Purpose |
52-
|------|--------|---------|
69+
A spec may have several `.cfg` files exploring different bounds:
70+
- `<Spec>.cfg` — primary configuration, balanced for routine verification
71+
- `<Spec>_<focus>.cfg` — alternate configurations targeting specific
72+
behaviors (e.g. `_chains` for deep merge chains, `_small` for fastest
73+
iteration). Each `.cfg` runs against the *same* `.tla` source.
5374

54-
*No specifications yet. Specs will be created as protocols are designed.*
75+
## Available Specifications
76+
77+
| Spec | Small States | Full States | Invariants |
78+
|------|-------------|-------------|------------|
79+
| ParquetDataModel | 8 | millions | DM-1..DM-5 |
80+
| SortSchema | 49,490 | millions | SS-1..SS-5 |
81+
| TimeWindowedCompaction | 938 | thousands | TW-1..3, CS-1..3, MC-1..4 |
82+
| MergePipelineShutdown (primary, multi-lifetime) || 15,732 (~1s) | RowsConserved, NoSplitLoss, NoDuplicateMerge, NoOrphanInPlanner, NoOrphanWhenConnected, LeakIsObjectStoreOnly, MP1\_LevelHomogeneity, BoundedWriteAmp, FinalizeWithinBound, ShutdownOnlyWhenDrained + RestartReSeedsAllImmature (action) + ShutdownEventuallyCompletes, NoPersistentOrphan (liveness) |
83+
| MergePipelineShutdown\_chains (deep merge chain) || 217,854 (~12s) | same invariant set; MaxIngests=4, MaxRestarts=1 |
5584

5685
## Creating New Specs
5786

5887
1. Create `NewProtocol.tla` with the specification
59-
2. Create `NewProtocol.cfg` with constants and properties to check
60-
3. Add entry to `README.md` mapping table
61-
4. Run model checker to verify
88+
2. Create `NewProtocol_small.cfg` (minimal constants for fast iteration)
89+
3. Create `NewProtocol.cfg` (larger constants for thorough checking)
90+
4. Update the table above
91+
5. Run both configs through TLC to verify
6292

6393
### Config File Template
6494

@@ -69,6 +99,8 @@ CONSTANTS
6999
Nodes = {n1, n2}
70100
MaxItems = 3
71101
102+
CHECK_DEADLOCK FALSE
103+
72104
SPECIFICATION Spec
73105
74106
INVARIANTS
@@ -81,7 +113,7 @@ PROPERTIES
81113

82114
## Cleanup
83115

84-
TLC generates trace files and state directories on errors. Clean them up with:
116+
TLC generates trace files and state directories on errors:
85117

86118
```bash
87119
rm -rf docs/internals/specs/tla/*_TTrace_*.tla docs/internals/specs/tla/*.bin docs/internals/specs/tla/states
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
\* TLC Configuration for MergePipelineShutdown.tla — multi-lifetime focus.
2+
\*
3+
\* Optimized for exercising crash/restart and cross-process recovery: fewer
4+
\* ingests but more restarts. ~16K distinct states; finishes in ~1 second on
5+
\* a workstation.
6+
\*
7+
\* Coverage emphasis:
8+
\* - Three full process lifetimes (initial + 2 restarts)
9+
\* - Crash mid-flight (uploaded outputs become orphan_outputs)
10+
\* - Disconnect → Publish (output published but planner not informed)
11+
\* - Restart re-seeds cold_windows from durable published_splits
12+
\* - Graceful shutdown then Restart (fresh process post-shutdown)
13+
\* - NoPersistentOrphan liveness across multiple lifetimes
14+
\*
15+
\* For deeper merge-chain exercise (level 0→1→2 mature, BoundedWriteAmp on
16+
\* multi-step compaction), see MergePipelineShutdown_chains.cfg.
17+
\*
18+
\* Tuning notes:
19+
\* - MaxIngests dominates state space (~5x per increment).
20+
\* - MaxIngests=4 with MaxRestarts=2 inflates to ~3M states; liveness
21+
\* check then takes several minutes on the temporal-tableau pass.
22+
23+
CONSTANTS
24+
MaxMerges = 2
25+
MaxFinalizeOps = 1
26+
Windows = {w1, w2}
27+
MergeFactor = 2
28+
MaxIngests = 3
29+
MaxMergeOps = 2
30+
MaxCrashes = 1
31+
MaxRestarts = 2
32+
33+
CHECK_DEADLOCK FALSE
34+
35+
SPECIFICATION Spec
36+
37+
INVARIANTS
38+
TypeInvariant
39+
RowsConserved
40+
NoSplitLoss
41+
NoDuplicateMerge
42+
NoOrphanInPlanner
43+
NoOrphanWhenConnected
44+
LeakIsObjectStoreOnly
45+
MP1_LevelHomogeneity
46+
BoundedWriteAmp
47+
FinalizeWithinBound
48+
ShutdownOnlyWhenDrained
49+
50+
PROPERTIES
51+
RestartReSeedsAllImmature
52+
ShutdownEventuallyCompletes
53+
NoPersistentOrphan

0 commit comments

Comments
 (0)