quickwit-oss
diff --git a/‎.claude/skills/sesh-mode/SKILL.md‎
Lines changed: 82 additions & 0 deletions b/‎.claude/skills/sesh-mode/SKILL.md‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 10 additions & 5 deletions b/‎.github/CODEOWNERS‎
Lines changed: 10 additions & 5 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/internals/specs/tla/CLAUDE.md‎
Lines changed: 64 additions & 32 deletions b/‎docs/internals/specs/tla/CLAUDE.md‎
Lines changed: 64 additions & 32 deletions
diff --git a/‎docs/internals/specs/tla/MergePipelineShutdown.cfg‎
Lines changed: 53 additions & 0 deletions b/‎docs/internals/specs/tla/MergePipelineShutdown.cfg‎
Lines changed: 53 additions & 0 deletions
@@ -44,6 +44,88 @@ Stateright      DST Tests      Production Metrics
 (exhaustive)    (simulation)   (Observability)
 ```
 
+## STOP, Don't Weaken — Specs are Load-Bearing
+
+Whether the load-bearing claim is a formal property (TLA+ invariant,
+Stateright property, DST assertion) or an English-language user
+requirement, the rule is the same: **MUST NOT** silently weaken it.
+First action is diagnosis and an explicit conversation with the user,
+not modification.
+
+### Verification check fails
+
+When a TLA+ invariant, Stateright property, or DST assertion fails, the
+failure has exactly two possible causes:
+
+1. **The implementation/model has a real bug.** Fix the bug.
+2. **The property is over-strong** — it asserts something the design does not
+   actually guarantee. *Sometimes* the right answer is to weaken or replace
+   the property — but the failing trace was probably also revealing the
+   real, weaker safety property the design *does* guarantee. Almost never
+   should the answer be just "remove the property."
+
+**Required protocol**:
+- Read the failing trace. State out loud what the property meant to claim,
+  and what the trace shows.
+- If unsure whether (1) or (2) applies — **stop and ask the user**.
+  Do not silently rewrite the property.
+- If (2) applies and you propose to weaken/replace, present the candidate
+  replacement *and* the safety property the original was reaching for, then
+  ask before changing. The replacement should usually preserve the spirit
+  via a different formulation (action property, liveness, narrower precondition)
+  — not just delete the constraint.
+
+**Forbidden** without explicit user approval:
+- Renaming an invariant to make its negation trivially true
+- Deleting an invariant that just produced a counter-example
+- Adding `=> TRUE` or other no-op weakenings
+- Changing `\A` to `\E`, `[]` to `<>`, or similar quantifier flips, when the
+  motive is to suppress a violation rather than to capture a different claim
+
+### User requirement seems hard or out-of-scope
+
+The same rule applies when translating from the user's spec into a plan,
+or from the plan into implementation. If a stated requirement looks hard,
+expensive, or "more than this PR needs," **stop and ask first** — do not
+decide unilaterally that "close enough" is acceptable.
+
+Silent-weakening moves to watch for:
+- **Granularity downgrade**: user asked for page-by-page streaming;
+  plan says "column-chunk granularity is the natural unit." The memory
+  bound the user actually wanted is gone.
+- **Constraint dropping**: user said "must not OOM under load"; plan
+  treats it as "bounded in the typical case." The qualifier was doing
+  load-bearing work and you removed it.
+- **Strength reduction**: user said "byte-identical metadata"; plan
+  says "logically equivalent metadata." Test passes; spec is gone.
+- **MUST → SHOULD**: user said "one GET per file"; plan allows "two in
+  the rare retry case." May be reasonable, but only the user can
+  authorize it.
+- **Reframing as out-of-scope**: user described a property as part of
+  the goal; plan declares it a follow-up PR. If the user didn't say
+  "split it," you don't get to.
+
+**Required protocol** mirrors the verification case: quote the user's
+original phrasing back to yourself; if you cannot point at the source
+phrase that authorizes the weaker version, you are not authorized to
+ship it. Surface the gap explicitly: *"you asked for X; my plan does
+Y; here's why I considered Y; ok to weaken, or do you want X?"* The
+default answer is X.
+
+**Plan-document approval is not spec approval.** When the plan you write
+diverges from the spec at write time, the user reviews the diverged
+plan, not the original requirement. Accountability for the divergence
+is on you — they approved what you wrote, not what they asked for.
+Re-read the user's original message before writing code.
+
+### Why this matters
+
+Specs — formal or English — describe the system's promise to its
+users. When a check fails or a requirement seems too strict, that has
+just told you either that the implementation is wrong, or that you've
+been claiming a stronger promise than you actually keep. Both deserve
+a conscious decision, never a silent edit.
+
 ## Testing Through Production Path
 
 **MUST NOT** claim a feature works unless tested through the actual network stack.
 
@@ -1,9 +1,7 @@
 # CODEOWNERS — see https://docs.github.com/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners
 #
-# Last matching rule per file wins. Approval from any listed owner satisfies
-# the requirement. quickwit-dev is listed on every rule so it can always
-# approve; an additional team is listed on the metrics Parquet pipeline paths
-# so PRs scoped to those paths can be approved by either team.
+# Last matching rule per file wins. Multiple owners listed on the same line
+# means approval from ANY ONE of them satisfies the requirement.
 
 # Default: quickwit-core owns everything
 *                                                        @quickwit-oss/quickwit-core
@@ -13,4 +11,11 @@
 /quickwit/quickwit-datafusion/                           @quickwit-oss/byoc-metrics
 /quickwit/quickwit-df-core/                              @quickwit-oss/byoc-metrics
 /quickwit/quickwit-dst/                                  @quickwit-oss/byoc-metrics
-/quickwit/quickwit-indexing/src/actors/metrics_pipeline/ @quickwit-oss/byoc-metrics
+/quickwit/quickwit-indexing/src/actors/parquet_pipeline/ @quickwit-oss/byoc-metrics
+
+# Shared paths — either team can approve. `docs/internals/` is shared
+# architecture + verification docs that any team working on the codebase
+# may need to update. `Cargo.lock` churns on routine dependency bumps
+# and doesn't carry domain-specific review value.
+/docs/internals/                                         @quickwit-oss/quickwit-core @quickwit-oss/byoc-metrics
+/quickwit/Cargo.lock                                     @quickwit-oss/quickwit-core @quickwit-oss/byoc-metrics
@@ -32,3 +32,6 @@ qwdata
 
 # GSD planning artifacts (local only)
 .planning
+
+# TLC model checker working directory (state dumps from `tla2tools.jar`)
+states/
@@ -4,61 +4,91 @@ This directory contains formal TLA+ specifications for Quickwit protocols.
 
 ## Setup (One-Time)
 
-Install TLA+ tools to a standard location (NOT in this directory):
+Install TLA+ Toolbox via Homebrew cask:
 
 ```bash
-# Option 1: Homebrew (recommended on macOS)
-brew install tlaplus
+brew install --cask tla+-toolbox
+```
 
-# Option 2: Download to ~/.local/lib
-mkdir -p ~/.local/lib
-curl -L -o ~/.local/lib/tla2tools.jar \
-  "https://github.com/tlaplus/tlaplus/releases/download/v1.8.0/tla2tools.jar"
+This installs the TLA+ Toolbox app to `/Applications/TLA+ Toolbox.app`,
+which bundles `tla2tools.jar` at:
 
-# Option 3: VS Code extension (for interactive use)
-# Install: alygin.vscode-tlaplus
 ```
+/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar
+```
+
+No separate `tlc` CLI is installed — run TLC via `java -jar`.
 
 ## Running Model Checker
 
-From the repository root:
+All commands run from the **repository root** (`quickwit-oss/quickwit/`):
 
 ```bash
-# If installed via Homebrew:
-tlc -config docs/internals/specs/tla/ExampleProtocol.cfg docs/internals/specs/tla/ExampleProtocol.tla
+# Quick check (small config, ~seconds):
+java -XX:+UseParallelGC -Xmx4g \
+  -jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
+  -workers 4 \
+  -config docs/internals/specs/tla/SomeSpec_small.cfg \
+  docs/internals/specs/tla/SomeSpec.tla
 
-# If using downloaded jar:
-java -XX:+UseParallelGC -Xmx4g -jar ~/.local/lib/tla2tools.jar \
+# Full check (larger state space, ~seconds to minutes):
+java -XX:+UseParallelGC -Xmx4g \
+  -jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
   -workers 4 \
-  -config docs/internals/specs/tla/ExampleProtocol.cfg \
-  docs/internals/specs/tla/ExampleProtocol.tla
+  -config docs/internals/specs/tla/SomeSpec.cfg \
+  docs/internals/specs/tla/SomeSpec.tla
 ```
 
-### Quick vs Full Verification
-
-Some specs have `_small.cfg` variants for faster iteration:
+### Run All Specs
 
 ```bash
-# Quick (~1s, hundreds of states)
-tlc -config docs/internals/specs/tla/ExampleProtocol_small.cfg docs/internals/specs/tla/ExampleProtocol.tla
-
-# Full (~minutes, millions of states)
-tlc -workers 4 -config docs/internals/specs/tla/ExampleProtocol.cfg docs/internals/specs/tla/ExampleProtocol.tla
+cd /path/to/quickwit-oss/quickwit
+
+# Iterate every .cfg next to a .tla and run TLC on each pair. Some specs
+# ship multiple configs (e.g. one for fast iteration, one for thorough
+# coverage, one focused on a specific behavior); this loop handles all.
+for cfg in docs/internals/specs/tla/*.cfg; do
+  base="$(basename "${cfg%.cfg}")"
+  # Try <base>.tla as-is; if absent, strip a trailing _suffix
+  # (e.g. MergePipelineShutdown_chains.cfg -> MergePipelineShutdown.tla).
+  if [ -f "docs/internals/specs/tla/${base}.tla" ]; then
+    tla="docs/internals/specs/tla/${base}.tla"
+  else
+    tla="docs/internals/specs/tla/${base%_*}.tla"
+  fi
+  echo "=== $(basename "$tla") with $(basename "$cfg") ==="
+  java -XX:+UseParallelGC -Xmx4g \
+    -jar "/Applications/TLA+ Toolbox.app/Contents/Eclipse/tla2tools.jar" \
+    -workers 4 -config "$cfg" "$tla"
+  echo
+done
 ```
 
-## Available Specifications
+### Multiple Configs per Spec
 
-| Spec | Config | Purpose |
-|------|--------|---------|
+A spec may have several `.cfg` files exploring different bounds:
+- `<Spec>.cfg` — primary configuration, balanced for routine verification
+- `<Spec>_<focus>.cfg` — alternate configurations targeting specific
+  behaviors (e.g. `_chains` for deep merge chains, `_small` for fastest
+  iteration). Each `.cfg` runs against the *same* `.tla` source.
 
-*No specifications yet. Specs will be created as protocols are designed.*
+## Available Specifications
+
+| Spec | Small States | Full States | Invariants |
+|------|-------------|-------------|------------|
+| ParquetDataModel | 8 | millions | DM-1..DM-5 |
+| SortSchema | 49,490 | millions | SS-1..SS-5 |
+| TimeWindowedCompaction | 938 | thousands | TW-1..3, CS-1..3, MC-1..4 |
+| MergePipelineShutdown (primary, multi-lifetime) | — | 15,732 (~1s) | RowsConserved, NoSplitLoss, NoDuplicateMerge, NoOrphanInPlanner, NoOrphanWhenConnected, LeakIsObjectStoreOnly, MP1\_LevelHomogeneity, BoundedWriteAmp, FinalizeWithinBound, ShutdownOnlyWhenDrained + RestartReSeedsAllImmature (action) + ShutdownEventuallyCompletes, NoPersistentOrphan (liveness) |
+| MergePipelineShutdown\_chains (deep merge chain) | — | 217,854 (~12s) | same invariant set; MaxIngests=4, MaxRestarts=1 |
 
 ## Creating New Specs
 
 1. Create `NewProtocol.tla` with the specification
-2. Create `NewProtocol.cfg` with constants and properties to check
-3. Add entry to `README.md` mapping table
-4. Run model checker to verify
+2. Create `NewProtocol_small.cfg` (minimal constants for fast iteration)
+3. Create `NewProtocol.cfg` (larger constants for thorough checking)
+4. Update the table above
+5. Run both configs through TLC to verify
 
 ### Config File Template
 
@@ -69,6 +99,8 @@ CONSTANTS
     Nodes = {n1, n2}
     MaxItems = 3
 
+CHECK_DEADLOCK FALSE
+
 SPECIFICATION Spec
 
 INVARIANTS
@@ -81,7 +113,7 @@ PROPERTIES
 
 ## Cleanup
 
-TLC generates trace files and state directories on errors. Clean them up with:
+TLC generates trace files and state directories on errors:
 
 ```bash
 rm -rf docs/internals/specs/tla/*_TTrace_*.tla docs/internals/specs/tla/*.bin docs/internals/specs/tla/states
 
@@ -0,0 +1,53 @@
+\* TLC Configuration for MergePipelineShutdown.tla — multi-lifetime focus.
+\*
+\* Optimized for exercising crash/restart and cross-process recovery: fewer
+\* ingests but more restarts. ~16K distinct states; finishes in ~1 second on
+\* a workstation.
+\*
+\* Coverage emphasis:
+\*   - Three full process lifetimes (initial + 2 restarts)
+\*   - Crash mid-flight (uploaded outputs become orphan_outputs)
+\*   - Disconnect → Publish (output published but planner not informed)
+\*   - Restart re-seeds cold_windows from durable published_splits
+\*   - Graceful shutdown then Restart (fresh process post-shutdown)
+\*   - NoPersistentOrphan liveness across multiple lifetimes
+\*
+\* For deeper merge-chain exercise (level 0→1→2 mature, BoundedWriteAmp on
+\* multi-step compaction), see MergePipelineShutdown_chains.cfg.
+\*
+\* Tuning notes:
+\*   - MaxIngests dominates state space (~5x per increment).
+\*   - MaxIngests=4 with MaxRestarts=2 inflates to ~3M states; liveness
+\*     check then takes several minutes on the temporal-tableau pass.
+
+CONSTANTS
+    MaxMerges = 2
+    MaxFinalizeOps = 1
+    Windows = {w1, w2}
+    MergeFactor = 2
+    MaxIngests = 3
+    MaxMergeOps = 2
+    MaxCrashes = 1
+    MaxRestarts = 2
+
+CHECK_DEADLOCK FALSE
+
+SPECIFICATION Spec
+
+INVARIANTS
+    TypeInvariant
+    RowsConserved
+    NoSplitLoss
+    NoDuplicateMerge
+    NoOrphanInPlanner
+    NoOrphanWhenConnected
+    LeakIsObjectStoreOnly
+    MP1_LevelHomogeneity
+    BoundedWriteAmp
+    FinalizeWithinBound
+    ShutdownOnlyWhenDrained
+
+PROPERTIES
+    RestartReSeedsAllImmature
+    ShutdownEventuallyCompletes
+    NoPersistentOrphan