From 8e9bc4e288ee33b02b8b148dcc10a8306ff24fb5 Mon Sep 17 00:00:00 2001 From: Leo Alt Date: Wed, 11 Feb 2026 14:03:58 +0100 Subject: [PATCH 01/10] Document the RWM pipeline passes in detail Add comprehensive markdown documentation for each pass in the read-write registers pipeline: liveness analysis, register allocation, flattening (including parallel copy sequencing), and jump removal. Also add a pipeline overview document linking them together. Co-Authored-By: Claude Opus 4.6 --- src/loader/rwm/FLATTENING.md | 146 ++++++++++++++++++++++++++ src/loader/rwm/JUMP_REMOVAL.md | 48 +++++++++ src/loader/rwm/LIVENESS_ANALYSIS.md | 84 +++++++++++++++ src/loader/rwm/PIPELINE.md | 81 ++++++++++++++ src/loader/rwm/REGISTER_ALLOCATION.md | 134 +++++++++++++++++++++++ 5 files changed, 493 insertions(+) create mode 100644 src/loader/rwm/FLATTENING.md create mode 100644 src/loader/rwm/JUMP_REMOVAL.md create mode 100644 src/loader/rwm/LIVENESS_ANALYSIS.md create mode 100644 src/loader/rwm/PIPELINE.md create mode 100644 src/loader/rwm/REGISTER_ALLOCATION.md diff --git a/src/loader/rwm/FLATTENING.md b/src/loader/rwm/FLATTENING.md new file mode 100644 index 0000000..5039bde --- /dev/null +++ b/src/loader/rwm/FLATTENING.md @@ -0,0 +1,146 @@ +# Flattening Pass + +**Source:** `flattening/mod.rs`, `flattening/sequence_parallel_copies.rs` + +**Input:** `AllocatedDag` (DAG with concrete register assignments) +**Output:** `FunctionAsm` (linear sequence of assembly-like directives) + +## Purpose + +The flattening pass converts the DAG representation into a linear sequence of +instructions. By this point, all the hard decisions (register allocation, copy +minimization) have already been made. The flattening is a straightforward, linear +traversal that emits directives through the `rwm::Settings` trait. + +## Algorithm + +The pass does a forward traversal over the DAG nodes, processing each one into +directives: + +### Node Types + +- **Inputs:** At the function level, emits the function label (and an exported name + alias if the function is exported). At loop level, emits nothing (the loop label + is emitted by the parent Loop node). + +- **Label:** Emits a label directive. + +- **Loop:** Copies loop inputs to where the loop body expects them (if they are not + already there), emits the loop label, then recursively processes the loop body. + A control stack tracks the allocation context for each nesting level. + +- **Br (unconditional break):** Emits the copies needed to place break inputs at + their target locations, then emits a jump. Three kinds of targets: + - **Forward label:** Jump to a label in the current block. + - **Loop back-edge:** Jump to the loop's header label, with copies to set up + the next iteration's inputs. + - **Function return:** Copy outputs to the return slots, then emit a `return` + instruction (which reads RA and FP from their known positions). + +- **BrIf / BrIfZero (conditional break):** Combines a conditional jump with the + break logic. The pass tries several strategies in order of preference: + 1. If the target is a plain jump (no copies needed) and the ISA supports the + matching condition, emit a single conditional jump. + 2. If the ISA supports the inverse condition, emit: inverse-conditional-jump to + continuation, then the full break code, then the continuation label. + 3. If only the matching condition is available, emit: conditional-jump to jump + code, then jump to continuation, then jump code, then continuation label. + +- **BrTable:** Handles multi-way branching. Emits a bounds check for the default + case, then a relative jump (jump table) into per-target jump code. Targets that + are plain jumps are inlined into the jump table; complex targets get an extra + indirection through a local label. + +- **Call:** For imported functions, emits a direct imported-call directive. For + local functions, prepares the call frame: copies inputs to the expected positions, + emits the call instruction (with frame offset, RA, and FP locations), then copies + outputs from the return slots to where consumers expect them. + +- **CallIndirect:** Like a normal call, but first loads the function reference from + the table, checks the function type against the expected signature (trapping on + mismatch), then emits an indirect call. + +- **WASMOp:** Delegates directly to `emit_wasm_op` with the resolved input + registers/constants and output register. + +- **Unreachable:** Emits a trap. + +## Parallel Copy Sequencing + +A critical sub-problem in flattening is emitting register copies correctly when +multiple values need to move simultaneously (e.g., setting up a loop iteration's +inputs, or preparing function call arguments). The naive approach of emitting copies +one by one can fail when source and destination registers overlap. + +### The Problem + +Consider needing to move `r0 → r1` and `r1 → r0` (a swap). Doing them sequentially +would overwrite `r1` before reading it. More generally, the copy set forms a directed +graph that may contain: + +1. **Trees:** Leaves are destination-only; root is source-only. Safe to copy in + reverse topological order. +2. **Cycles:** Every register is both source and destination. Requires a temporary + register to break the cycle. +3. **Cycles with attached trees:** A single cycle with trees branching off. The + tree-pruning phase naturally breaks the cycle through source-swapping. + +### The Algorithm (`sequence_parallel_copies.rs`) + +**Phase 1 — Tree pruning:** +1. Find all "tree ends" (registers with no outgoing edges, i.e., destination-only). +2. For each tree end, emit the copy from its source and remove the edge. +3. Apply **source-swapping**: transfer the source's remaining outgoing edges to the + just-written destination. This is the key insight — since the destination now + holds the same value as the original source, it can serve as the source for + remaining copies, potentially breaking a connected cycle. +4. If the original source now has no outgoing edges but has an incoming edge, it + becomes a new tree end. Repeat until no tree ends remain. + +**Phase 2 — Cycle breaking:** +1. The remaining graph consists only of pure cycles. +2. Pick a temporary register — either reuse a destination register from Phase 1 + (since it was already written and can serve double duty before Phase 1's copies + execute), or allocate a new `Temp` register. +3. For each cycle: save one value to temp, rotate the rest, restore from temp. + +**Output ordering:** Phase 2 copies are emitted first, then Phase 1 copies in +reverse. This ensures the temporary register is not overwritten by Phase 1 copies +before it is consumed by Phase 2. + +### Correctness Guarantee + +Every destination register appears exactly once (precondition). The algorithm +produces a valid sequential ordering that achieves the same effect as executing all +copies simultaneously. At most one temporary register is needed, and it is avoided +entirely when the copy graph is acyclic or has trees attached to cycles. + +## Temporary Register Allocation + +During flattening, some operations need temporary registers (e.g., loading a +function reference for indirect calls, or the temp register for parallel copies). +The `Context` struct provides `allocate_tmp_type` which: + +1. Lazily computes the set of free register gaps at the current node by examining + the occupation map from register allocation. +2. Allocates from the first gap that fits. +3. For function call nodes, can also allocate temporaries in the callee's frame + space (after the calling convention prelude). + +## Settings Trait + +The flattening pass is parameterized by `rwm::Settings`, which provides all the +`emit_*` methods. This trait defines how each operation maps to the target ISA's +directives. The reference implementation is `GenericIrSetting` in +`src/interpreter/generic_ir.rs`. + +Key emission methods used by flattening: +- `emit_label`, `emit_jump`, `emit_trap` +- `emit_copy` — Single-word register copy +- `emit_conditional_jump` — Jump on a boolean condition +- `emit_conditional_jump_cmp_immediate` — Jump on comparison with immediate (for BrTable bounds) +- `emit_relative_jump` — Jump by offset (for BrTable dispatch) +- `emit_return` — Function return (restores RA/FP) +- `emit_function_call`, `emit_indirect_call` — Local and indirect calls +- `emit_imported_call` — Imported (external) function call +- `emit_wasm_op` — Generic WASM instruction emission diff --git a/src/loader/rwm/JUMP_REMOVAL.md b/src/loader/rwm/JUMP_REMOVAL.md new file mode 100644 index 0000000..7c584cd --- /dev/null +++ b/src/loader/rwm/JUMP_REMOVAL.md @@ -0,0 +1,48 @@ +# Jump Removal Pass + +**Source:** `../wom/dumb_jump_removal.rs` (shared between WOM and RWM pipelines) + +**Input:** `FunctionAsm` (`PlainFlatAsm` — linear directive sequence) +**Output:** `FunctionAsm` (`DumbJumpOptFlatAsm` — optimized directive sequence) + +## Purpose + +This is a simple peephole optimization that removes unconditional jumps whose target +is the immediately following instruction. These "dumb jumps" are an artifact of the +flattening pass, which always emits jumps for breaks even when the target label +happens to be placed right after the jump. + +## Algorithm + +The pass does a single linear scan over the directive sequence, examining each +consecutive pair of directives: + +1. For each directive, check if it is an unconditional local jump (via + `Settings::to_plain_local_jump`). +2. If it is, check whether the next directive is a label matching the jump target + (via `Settings::is_label`). +3. If both conditions hold, drop the jump — it is redundant. +4. Otherwise, keep the directive. + +## Why This Happens + +The flattening pass emits jumps for every break instruction in the DAG. When a +break targets a label that happens to appear immediately after the break in the +linearized output, the resulting jump is unnecessary. This is common in patterns +like: + +``` + ; end of if-true branch + jump label_42 ; ← dumb jump, label_42 is right below +label_42: + ; continuation +``` + +The flattening pass does not attempt to detect this during emission because the +DAG structure does not guarantee any particular ordering. Instead, this cheap +post-processing pass cleans them up. + +## Statistics + +The pass returns the count of removed jumps, which is aggregated in the +`Statistics::useless_jumps_removed` counter. diff --git a/src/loader/rwm/LIVENESS_ANALYSIS.md b/src/loader/rwm/LIVENESS_ANALYSIS.md new file mode 100644 index 0000000..f8fd713 --- /dev/null +++ b/src/loader/rwm/LIVENESS_ANALYSIS.md @@ -0,0 +1,84 @@ +# Liveness Analysis Pass + +**Source:** `liveness_dag.rs` + +**Input:** `BlocklessDag` (from the common pipeline) +**Output:** `LivenessDag` (same DAG structure, annotated with `Liveness` data per block) + +## Purpose + +The liveness analysis pass takes the blockless DAG produced by the common pipeline and +annotates it with information about when each value is last used. This information is +essential for the register allocation pass that follows, enabling it to reuse registers +once their values are no longer needed. + +## What It Computes + +For each block (the function body and each loop body), the pass produces a `Liveness` +struct containing: + +### `last_usage`: HashMap<(node_index, output_index), node_index> + +Maps each value (identified by its producing node and output index) to the index of the +last node that reads it. This tells the register allocator exactly when a register can +be freed. + +For example, if node 3 produces a value at output 0, and the last node that uses it is +node 7, then `last_usage[(3, 0)] = 7`. After node 7, the register holding this value +can be reused by another value. + +Values that are never used by any other node have their last usage set to their own +node index (i.e., they are dead immediately after being produced). + +### `redirected_inputs`: Vec + +A sorted, deduplicated list of loop input indices that are simply forwarded unchanged +to the next iteration. This is an optimization hint for register allocation: if a loop +input is always passed through without modification, the register allocator can keep it +in the same register across all iterations, avoiding unnecessary copies at each loop +back-edge. + +## Algorithm + +The pass does a single forward traversal over the nodes in each block: + +1. **Forward scan:** For each node, iterate over its inputs. If an input references + another node's output, update `last_usage` for that output to the current node index. + Also initialize each node's own outputs with `last_usage = current_index` (dead by + default). + +2. **Recursive processing of loops:** When a `Loop` operation is encountered, the pass + recurses into the loop's sub-DAG. Before recursing, it sets up a control stack entry + to track input redirection. + +3. **Input redirection tracking:** For loop blocks, the pass tracks which inputs are + simply forwarded as-is to the next iteration. It does this by examining every break + instruction that targets the loop and checking whether each break input is a direct + reference to the corresponding loop input (node 0). The tracking accounts for nested + loops by mapping input indices through the control stack. + +## Control Stack + +The pass maintains a `VecDeque` to track nested loop contexts: + +- `is_input_redirected: Vec` — One flag per loop input, initially all `true`. + Set to `false` when any break to this loop provides a value other than the + corresponding input passed through unchanged. + +- `input_map: HashMap` — Maps input indices of the current loop to + output indices of the parent block's input node. This is needed to trace redirected + inputs through nested loops. For example, if loop input 2 comes from the parent + block's input 5, then `input_map[2] = 5`. + +## Design Notes + +- The liveness information is conservative (pessimistic): `last_usage` reflects the + last usage across *all* control flow paths, not just the path currently being taken. + A TODO in the code notes that per-path liveness could yield better register allocation. + +- A TODO also suggests that this pass could potentially be merged with register + allocation itself, using a bottom-up traversal similar to the WOM flattening pass. + +- The pass handles all break variants (`Br`, `BrIf`, `BrIfZero`, `BrTable`) when + checking input redirection. For conditional breaks, only the non-condition inputs + are checked. For `BrTable`, each target's input permutation is respected. diff --git a/src/loader/rwm/PIPELINE.md b/src/loader/rwm/PIPELINE.md new file mode 100644 index 0000000..c9faf3e --- /dev/null +++ b/src/loader/rwm/PIPELINE.md @@ -0,0 +1,81 @@ +# RWM Pipeline Overview + +The read-write registers (RWM) pipeline converts a blockless DAG into a linear +sequence of assembly-like directives for machines with standard read-write registers. + +## Stages + +``` +BlocklessDag (from common pipeline) + │ + ▼ +LivenessDag liveness_dag.rs + │ Annotates each value with its last usage, and + │ detects loop inputs that are forwarded unchanged. + │ + ▼ +RegisterAllocatedDag register_allocation/ + │ Assigns concrete register numbers to all values, + │ using liveness to reuse registers and heuristics + │ to minimize copies. + │ + ▼ +PlainFlatAsm flattening/ + │ Linearizes the DAG into directives, emitting + │ copies where register assignments don't match, + │ and handling control flow, calls, and jumps. + │ + ▼ +DumbJumpOptFlatAsm ../wom/dumb_jump_removal.rs + Removes unconditional jumps to the immediately + following label. +``` + +## Detailed Documentation + +Each pass has its own documentation file: + +- **[LIVENESS_ANALYSIS.md](LIVENESS_ANALYSIS.md)** — Forward analysis computing + last-usage information and loop input redirection detection. + +- **[REGISTER_ALLOCATION.md](REGISTER_ALLOCATION.md)** — Bottom-up optimistic + register allocation with hint-based placement and occupation tracking. + +- **[FLATTENING.md](FLATTENING.md)** — DAG linearization, parallel copy sequencing, + and directive emission through the Settings trait. + +- **[JUMP_REMOVAL.md](JUMP_REMOVAL.md)** — Peephole pass removing redundant + unconditional jumps. + +- **[CALLING_CONVENTION.md](CALLING_CONVENTION.md)** — Frame layout and calling + convention for stacked read-write registers. + +## Key Design Decisions + +### Bottom-Up Register Allocation + +The allocation runs in reverse node order. This means that by the time a value is +allocated, we already know where its consumers want it. The allocator can then +propose ("hint") register placements that align with consumer expectations, avoiding +copies. This is the main source of optimization in the pipeline. + +### Nested Occupation Tracking + +Loops create a nested scope for register allocation. The parent's occupied registers +are inherited as blocked ranges in the child tracker. After the loop body is +processed, registers used internally by the loop are projected back to the parent as +blocked, preventing the parent from placing long-lived values in registers that the +loop would overwrite. + +### Parallel Copy Resolution + +When multiple values need to move simultaneously (loop back-edges, function calls), +the flattening pass uses a graph-based algorithm to find a valid sequential ordering. +It handles trees with topological sorting, and breaks cycles with at most one +temporary register. + +### Separation of Concerns + +The pipeline cleanly separates liveness analysis, register allocation, and code +emission into distinct passes. This makes each pass simpler and easier to test in +isolation, at the cost of one extra traversal compared to a fused approach. diff --git a/src/loader/rwm/REGISTER_ALLOCATION.md b/src/loader/rwm/REGISTER_ALLOCATION.md new file mode 100644 index 0000000..33d903a --- /dev/null +++ b/src/loader/rwm/REGISTER_ALLOCATION.md @@ -0,0 +1,134 @@ +# Register Allocation Pass + +**Source:** `register_allocation/mod.rs`, `register_allocation/occupation_tracker.rs` + +**Input:** `LivenessDag` (DAG annotated with liveness information) +**Output:** `AllocatedDag` (same DAG structure, annotated with `Allocation` data per block) + +## Purpose + +This pass assigns concrete register numbers to every value in the DAG. It uses the +liveness information from the previous pass to reuse registers once their values are +no longer needed. The allocator also applies heuristics to minimize the number of +register-to-register copies that the flattening pass will need to emit. + +## Algorithm Overview + +The allocation is done **bottom-up** (from the last node to the first), following +execution paths independently. This reverse traversal is key: by the time we allocate +a value, we already know where its consumers expect it to be, allowing us to propose +register assignments that avoid copies. + +The main function is `optimistic_allocation`, which: + +1. Fixes the function input registers at positions 0, 1, 2, ... (tightly packed + according to word count per type). +2. Reserves space for the return address (RA) and frame pointer (FP) after + `MAX(input_words, output_words)`, per the calling convention. +3. Runs the recursive bottom-up allocation on all nodes. + +### Optimistic Allocation Strategy + +The allocator is called "optimistic" because it tries to place values at hinted +locations (where their consumers expect them), falling back to the first available +gap only when the hint is unavailable. This two-phase approach avoids copies when +possible while guaranteeing correctness. + +For each node, processed in reverse order: + +- **Generic WASM operations:** Inputs and outputs are allocated wherever there is + space. No special hinting is needed. + +- **Function calls (Call/CallIndirect):** The allocator first determines the call + frame start (the first register after all currently occupied ones). Then it tries + to place each input at the exact register where the callee expects it, saving a + copy if successful. Function outputs are pre-allocated at their natural position + on the call frame. + +- **Labels:** Outputs are allocated at whatever position is available. Break + instructions targeting this label will try to match these positions. + +- **Breaks (Br/BrIf/BrIfZero):** For each break input, the allocator tries to + place it at the same register where the target (loop input, label output, or + function return slot) expects it. This is the main copy-saving mechanism. + +- **BrTable:** Each target's input permutation is processed like a regular break. + The selector input is allocated separately. + +- **Loops:** A child occupation tracker is created, inheriting the parent's blocked + registers. Two heuristics minimize copies for loop inputs: + 1. If the loop input is the last usage of a value in the outer scope, reuse its + register for the loop input. + 2. If the loop input is a "redirected input" (forwarded unchanged across + iterations, as detected by the liveness pass), force the same register + allocation, always saving a copy. + +## Occupation Tracker + +The `OccupationTracker` (`occupation_tracker.rs`) is the core data structure that +tracks which registers are occupied at each point in the program. + +### Data Model + +It maintains an `IntervalMap` that maps **liveness ranges** (expressed +as node index ranges) to allocation entries. Each entry records: + +- **`AllocationType`**: What kind of allocation it is: + - `Value(ValueOrigin)` — A normal DAG value. + - `FunctionFrame` — Space reserved for a callee's frame during a function call. + - `SubBlockInternal` — Registers used inside a loop body, blocked at the parent level. + - `BlockedRegistersAtParent` — Parent-level registers inherited by a sub-tracker. + - `ExplicitlyBlocked` — Reserved registers (e.g., RA/FP slots). + +- **`reg_range: Range`** — The register range this allocation occupies. + +### Key Operations + +- **`try_allocate(origin, size)`**: Allocates a value at the first available gap. + Returns `None` if already allocated. + +- **`try_allocate_with_hint(origin, hint)`**: Tries to place a value at a specific + register range. If the hint is occupied, falls back to first-fit. Returns whether + the hint was used. + +- **`set_allocation(origin, range)`**: Forces an allocation at a specific range + (used for fixed positions like function inputs). + +- **`reserve_range(range)`**: Permanently blocks a register range (used for RA/FP). + +- **`allocate_fn_call(call_index, output_sizes)`**: Reserves a function call frame + starting after all currently occupied registers. Pre-allocates outputs at their + natural positions on the frame. + +- **`make_sub_tracker(sub_block_index, sub_liveness)`**: Creates a child tracker + for a loop body, with all registers occupied at the loop's node index blocked. + +- **`project_from_sub_tracker(sub_block_index, sub_tracker)`**: After processing a + loop, blocks the registers that the loop body used internally, preventing the + parent from overwriting them with values that need to survive across the loop. + +### Gap-Finding Algorithm + +When a hint is not available, `allocate_where_possible` uses a simple first-fit +strategy: it consolidates all occupied ranges into sorted, non-overlapping intervals, +then scans for the first gap large enough to fit the requested size. If no gap exists, +it allocates at the end. + +## Statistics + +The pass tracks `register_copies_saved`: the total number of word-level copies that +were avoided by successfully placing values at their hinted locations. This metric +is aggregated across all functions and reported at the end of compilation. + +## Output + +The `Allocation` struct stored in the output `AllocatedDag` contains: + +- `nodes_outputs: BTreeMap>` — The concrete register range + assigned to each value. +- `occupation: Occupation` — The full register occupation map, used by the flattening + pass to find free registers for temporary allocations. +- `labels: HashMap` — Maps label IDs to their node indices, for quick + lookup when processing breaks. +- `call_frames: HashMap` — Maps call node indices to the start register + of their callee frame. From b682b79a58069be64389b2118caf78e59b2c6a12 Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:13:15 +0100 Subject: [PATCH 02/10] Update src/loader/rwm/FLATTENING.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/FLATTENING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/loader/rwm/FLATTENING.md b/src/loader/rwm/FLATTENING.md index 5039bde..d73425b 100644 --- a/src/loader/rwm/FLATTENING.md +++ b/src/loader/rwm/FLATTENING.md @@ -100,7 +100,7 @@ graph that may contain: **Phase 2 — Cycle breaking:** 1. The remaining graph consists only of pure cycles. 2. Pick a temporary register — either reuse a destination register from Phase 1 - (since it was already written and can serve double duty before Phase 1's copies + (since its original value is never read it can serve double duty before Phase 1's copies execute), or allocate a new `Temp` register. 3. For each cycle: save one value to temp, rotate the rest, restore from temp. From 12d0a03ec75f31108923b688e591bf6a5b472dec Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:13:57 +0100 Subject: [PATCH 03/10] Update src/loader/rwm/FLATTENING.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/FLATTENING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/loader/rwm/FLATTENING.md b/src/loader/rwm/FLATTENING.md index d73425b..251761e 100644 --- a/src/loader/rwm/FLATTENING.md +++ b/src/loader/rwm/FLATTENING.md @@ -105,7 +105,7 @@ graph that may contain: 3. For each cycle: save one value to temp, rotate the rest, restore from temp. **Output ordering:** Phase 2 copies are emitted first, then Phase 1 copies in -reverse. This ensures the temporary register is not overwritten by Phase 1 copies +sequence. This ensures the temporary register is not overwritten by Phase 1 copies before it is consumed by Phase 2. ### Correctness Guarantee From de684ac63d954d591d381ec0a215ba1c1dd9a7ea Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:14:24 +0100 Subject: [PATCH 04/10] Update src/loader/rwm/FLATTENING.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/FLATTENING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/loader/rwm/FLATTENING.md b/src/loader/rwm/FLATTENING.md index 251761e..be31e89 100644 --- a/src/loader/rwm/FLATTENING.md +++ b/src/loader/rwm/FLATTENING.md @@ -141,6 +141,6 @@ Key emission methods used by flattening: - `emit_conditional_jump_cmp_immediate` — Jump on comparison with immediate (for BrTable bounds) - `emit_relative_jump` — Jump by offset (for BrTable dispatch) - `emit_return` — Function return (restores RA/FP) -- `emit_function_call`, `emit_indirect_call` — Local and indirect calls +- `emit_function_call`, `emit_indirect_call` — Static and indirect calls - `emit_imported_call` — Imported (external) function call - `emit_wasm_op` — Generic WASM instruction emission From d632ccd4a0a25420c3b350ee950836144a901ab1 Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:14:51 +0100 Subject: [PATCH 05/10] Update src/loader/rwm/JUMP_REMOVAL.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/JUMP_REMOVAL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/loader/rwm/JUMP_REMOVAL.md b/src/loader/rwm/JUMP_REMOVAL.md index 7c584cd..33958ff 100644 --- a/src/loader/rwm/JUMP_REMOVAL.md +++ b/src/loader/rwm/JUMP_REMOVAL.md @@ -9,7 +9,7 @@ This is a simple peephole optimization that removes unconditional jumps whose target is the immediately following instruction. These "dumb jumps" are an artifact of the -flattening pass, which always emits jumps for breaks even when the target label +DAG representation, where all breaks are explicit, even when the target happens to be placed right after the jump. ## Algorithm From 106a3b90ec937caa424ad8401de6efb01b49452f Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:15:16 +0100 Subject: [PATCH 06/10] Update src/loader/rwm/JUMP_REMOVAL.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/JUMP_REMOVAL.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/loader/rwm/JUMP_REMOVAL.md b/src/loader/rwm/JUMP_REMOVAL.md index 33958ff..1dbc83a 100644 --- a/src/loader/rwm/JUMP_REMOVAL.md +++ b/src/loader/rwm/JUMP_REMOVAL.md @@ -38,8 +38,7 @@ label_42: ; continuation ``` -The flattening pass does not attempt to detect this during emission because the -DAG structure does not guarantee any particular ordering. Instead, this cheap +The flattening pass does not attempt to detect this during emission. Instead, this cheap post-processing pass cleans them up. ## Statistics From f725dddcbb55d087a9176942e1cdc9298eb4b5bb Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:15:28 +0100 Subject: [PATCH 07/10] Update src/loader/rwm/REGISTER_ALLOCATION.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/REGISTER_ALLOCATION.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/loader/rwm/REGISTER_ALLOCATION.md b/src/loader/rwm/REGISTER_ALLOCATION.md index 33d903a..2f77b63 100644 --- a/src/loader/rwm/REGISTER_ALLOCATION.md +++ b/src/loader/rwm/REGISTER_ALLOCATION.md @@ -42,8 +42,7 @@ For each node, processed in reverse order: - **Function calls (Call/CallIndirect):** The allocator first determines the call frame start (the first register after all currently occupied ones). Then it tries to place each input at the exact register where the callee expects it, saving a - copy if successful. Function outputs are pre-allocated at their natural position - on the call frame. + copy if successful. - **Labels:** Outputs are allocated at whatever position is available. Break instructions targeting this label will try to match these positions. From b7317bf6d6442b050560dc9bc21a47effe29d78c Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:15:38 +0100 Subject: [PATCH 08/10] Update src/loader/rwm/REGISTER_ALLOCATION.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/REGISTER_ALLOCATION.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/loader/rwm/REGISTER_ALLOCATION.md b/src/loader/rwm/REGISTER_ALLOCATION.md index 2f77b63..03e3432 100644 --- a/src/loader/rwm/REGISTER_ALLOCATION.md +++ b/src/loader/rwm/REGISTER_ALLOCATION.md @@ -73,7 +73,7 @@ It maintains an `IntervalMap` that maps **liveness ranges** (expre as node index ranges) to allocation entries. Each entry records: - **`AllocationType`**: What kind of allocation it is: - - `Value(ValueOrigin)` — A normal DAG value. + - `Value(ValueOrigin)` — A value produced by a DAG node (pointed by the `ValueOrigin` data). - `FunctionFrame` — Space reserved for a callee's frame during a function call. - `SubBlockInternal` — Registers used inside a loop body, blocked at the parent level. - `BlockedRegistersAtParent` — Parent-level registers inherited by a sub-tracker. From 167c83897b5d6ab56b47177d5da802017c31dce6 Mon Sep 17 00:00:00 2001 From: Leo Date: Fri, 20 Feb 2026 15:16:04 +0100 Subject: [PATCH 09/10] Update src/loader/rwm/REGISTER_ALLOCATION.md Co-authored-by: Lucas Clemente Vella --- src/loader/rwm/REGISTER_ALLOCATION.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/loader/rwm/REGISTER_ALLOCATION.md b/src/loader/rwm/REGISTER_ALLOCATION.md index 03e3432..ff93614 100644 --- a/src/loader/rwm/REGISTER_ALLOCATION.md +++ b/src/loader/rwm/REGISTER_ALLOCATION.md @@ -96,7 +96,7 @@ as node index ranges) to allocation entries. Each entry records: - **`reserve_range(range)`**: Permanently blocks a register range (used for RA/FP). - **`allocate_fn_call(call_index, output_sizes)`**: Reserves a function call frame - starting after all currently occupied registers. Pre-allocates outputs at their + starting after all currently occupied registers. Allocates unused outputs at their natural positions on the frame. - **`make_sub_tracker(sub_block_index, sub_liveness)`**: Creates a child tracker From 49e621a46880a5b28cb6965c1d50b7333af9d6ba Mon Sep 17 00:00:00 2001 From: Leo Alt Date: Fri, 20 Feb 2026 17:27:58 +0100 Subject: [PATCH 10/10] Document the common pipeline passes in detail Add documentation for each pass in the shared frontend pipeline, following the same style as the existing RWM pipeline docs. Co-Authored-By: Claude Opus 4.6 --- src/loader/passes/BLOCKLESS_DAG.md | 120 ++++++++++++++++++++++ src/loader/passes/BLOCK_TREE.md | 131 ++++++++++++++++++++++++ src/loader/passes/CONST_COLLAPSE.md | 73 ++++++++++++++ src/loader/passes/CONST_DEDUP.md | 87 ++++++++++++++++ src/loader/passes/DAG_CONSTRUCTION.md | 139 ++++++++++++++++++++++++++ src/loader/passes/DANGLING_REMOVAL.md | 103 +++++++++++++++++++ src/loader/passes/LOCALS_DATA_FLOW.md | 122 ++++++++++++++++++++++ src/loader/passes/PIPELINE.md | 139 ++++++++++++++++++++++++++ 8 files changed, 914 insertions(+) create mode 100644 src/loader/passes/BLOCKLESS_DAG.md create mode 100644 src/loader/passes/BLOCK_TREE.md create mode 100644 src/loader/passes/CONST_COLLAPSE.md create mode 100644 src/loader/passes/CONST_DEDUP.md create mode 100644 src/loader/passes/DAG_CONSTRUCTION.md create mode 100644 src/loader/passes/DANGLING_REMOVAL.md create mode 100644 src/loader/passes/LOCALS_DATA_FLOW.md create mode 100644 src/loader/passes/PIPELINE.md diff --git a/src/loader/passes/BLOCKLESS_DAG.md b/src/loader/passes/BLOCKLESS_DAG.md new file mode 100644 index 0000000..4ac62a2 --- /dev/null +++ b/src/loader/passes/BLOCKLESS_DAG.md @@ -0,0 +1,120 @@ +# Blockless DAG Pass + +**Source:** `blockless_dag.rs` + +**Input:** `DanglingOptDag` (optimized DAG with nested blocks and loops) +**Output:** `BlocklessDag` (flat DAG with labels; only loops retain sub-DAGs) + +## Purpose + +This is the last common pipeline pass before the backend-specific stages. It +flattens the nested block structure into a linear sequence of nodes with labels +marking jump targets. After this pass, the only nesting that remains is for +loops — each loop still has its own sub-DAG, because loops represent a separate +"frame" with its own address space in the final output. + +Non-loop blocks are fully inlined into their parent DAG, with their outputs +becoming labels that breaks can jump to. This makes the representation much +closer to assembly: a flat sequence of operations with forward-only jumps to +labels. + +## Key Transformation + +### Blocks Become Labels + +A non-loop block in the input DAG: +``` +Block { + kind: Block, + sub_dag: [Inputs, ..., Br(0, outputs)] +} +``` + +is inlined into the parent. The block's input node is suppressed (its outputs +are remapped to the corresponding inputs in the parent scope), and a `Label` +node is inserted where the block's outputs would be consumed. Break instructions +targeting the block become jumps to this label. + +### Loops Remain Nested + +Loop blocks keep their sub-DAG structure: +``` +Loop { + sub_dag: BlocklessDag { nodes: [...] }, + break_targets: [(depth, [target_types])] +} +``` + +The `break_targets` field records all the break targets that the loop body +uses, relative to the parent frame. This lets the backend know which external +labels/frames the loop may jump to. + +## Break Target Resolution + +In the input DAG, break targets are relative depths into the block stack. In the +blockless DAG, targets are resolved into `BreakTarget { depth, kind }`: + +- **`depth`**: The number of frame levels between the break and the target. At + the top level, depth 0 means the current function/loop frame. Inside a loop, + depth 1 means the parent frame, depth 2 the grandparent, etc. + +- **`kind`**: Either `FunctionOrLoop` (targeting the function return or a loop's + next iteration) or `Label(id)` (targeting a specific label created from an + inlined block). + +The key property: **jumps to labels are always forward** (labels appear after +the jumps that target them), while **jumps to loops go backward** (to the loop +header at the start of the loop's sub-DAG). + +## Example + +Input DAG (with nested block): +``` +Node 0: Inputs → [x] +Node 1: Block { + kind: Block, + sub_dag: [ + Node 0: Inputs → [x] + Node 1: i32.const 10 + Node 2: i32.gt_s ← [(0,0), (1,0)] + Node 3: br_if 0 ← [(0,0), (2,0)] ;; exit block if x > 10 + Node 4: i32.const 0 + Node 5: br 1 ← [(4,0)] ;; return 0 + ] +} → [result] +Node 2: br 0 ← [(1,0)] ;; return result +``` + +Output blockless DAG (flattened): +``` +Node 0: Inputs → [x] +Node 1: i32.const 10 +Node 2: i32.gt_s ← [(0,0), (1,0)] +Node 3: BrIf(Label(42)) ← [(0,0), (2,0)] ;; jump to label if x > 10 +Node 4: i32.const 0 +Node 5: Br(Function) ← [(4,0)] ;; return 0 +Node 6: Label { id: 42 } → [result] ;; target for the br_if +Node 7: Br(Function) ← [(6,0)] ;; return result +``` + +The block's internal input node (its node 0) was suppressed and its references +were remapped to the parent's node 0. The block itself became a label node. + +## Node Remapping + +When blocks are inlined, node indices change. The pass maintains an +`outputs_map: HashMap` that translates old +`(node, output)` pairs to new ones. For inlined block inputs, the map redirects +through the `input_mapping` to the actual source nodes in the parent. + +## Design Notes + +- Labels use unique IDs generated by a shared `AtomicU32` counter (the + `LabelGenerator`), ensuring uniqueness across all functions and all frames. + +- The pass preserves the `NodeInput::Constant` variant, passing inline + constants through unchanged. + +- Break targets are resolved relative to frame boundaries, not block nesting. + This is important because the backends allocate registers per-frame (per + function or per loop body), not per-block. diff --git a/src/loader/passes/BLOCK_TREE.md b/src/loader/passes/BLOCK_TREE.md new file mode 100644 index 0000000..07a9d30 --- /dev/null +++ b/src/loader/passes/BLOCK_TREE.md @@ -0,0 +1,131 @@ +# Block Tree Pass + +**Source:** `block_tree.rs` + +**Input:** Raw WASM function bytecode (`Unparsed`) +**Output:** `BlockTree` (tree of `Block` and `Instruction` elements) + +## Purpose + +This is the first pass in the compilation pipeline. It takes the raw stream of +WASM operators and parses them into a tree structure where control flow is +represented by nested blocks and loops, and instructions within each block form +a linear sequence. + +The pass also normalizes several WASM patterns into simpler, more uniform +representations that are easier for subsequent passes to handle. + +## Normalizations + +### If-Else to Block + BrIf + +WASM's `if-else-end` construct is desugared into blocks with conditional +breaks. This reduces the number of control flow constructs that later passes +need to handle. + +**If without else:** +``` +;; Original WASM ;; Normalized BlockTree +if block (params..., i32) -> (results...) + br_if_zero 0 ;; skip if_body when false +end + end +``` + +**If with else:** +``` +;; Original WASM ;; Normalized BlockTree +if block (params..., i32) -> (results...) + block (params..., i32) -> (params...) +else br_if 0 ;; skip else_body when true + +end br 1 ;; skip if_body + end + + end +``` + +The condition value is carried as an extra block input and consumed by the +conditional break at the top. + +### Return to Br + +WASM `return` is converted to a `br` targeting the outermost block (the +function body). This makes the function body just another block, simplifying +break handling. + +``` +;; Original ;; Normalized +return br +``` + +### Explicit Fallthrough Breaks + +Every block that can fall through gets an explicit `br 0` appended. This +guarantees that all blocks are exited via a break instruction, which simplifies +the locals data flow pass (it can assume all values leave blocks through break +inputs). + +``` +;; Original ;; Normalized +block block + i32.const 42 i32.const 42 +end br 0 ;; explicit fallthrough + end +``` + +### Loop Wrapping + +When a loop can fall through (i.e., it doesn't always branch back to the loop +header or exit via a break), an outer block is added around it. The fallthrough +becomes a break to the outer block. This ensures loops are only exited through +breaks. + +``` +;; Original ;; Normalized +loop block -> (results...) + loop (params...) +end + br 1 ;; exit to outer block + end + end +``` + +### Dead Code Removal + +After any instruction that unconditionally diverts control flow (`br`, +`br_table`, `unreachable`, or a non-fallthrough loop), all subsequent +instructions up to the next `end` or `else` are discarded. + +``` +;; Original ;; Normalized +br 0 br 0 +i32.const 1 ;; dead code removed +i32.add ;; dead code removed +``` + +### Constant Global Inlining + +`global.get` on immutable globals is replaced with the global's constant +initializer. This is done early because it enables the downstream constant +optimization passes to work with these values. + +``` +;; Original (global 0 is immutable, initialized to 42) +global.get 0 ;; Normalized: i32.const 42 +``` + +## Output Structure + +The output `BlockTree` is a `Vec` where each `Element` is either: + +- **`Instruction`**: A WASM operator, a `BrIfZero`, or a `BrTable`. +- **`Block`**: A nested block containing: + - `block_kind`: `Block` or `Loop` + - `interface_type`: The block's input and output types + - `elements`: The block's contents (recursively) + - `input_locals`, `output_locals`, `carried_locals`: Initially empty; filled + by the next pass + +At this stage, all blocks have well-defined stack-level interfaces (params and +results), but local variable flow is still implicit. diff --git a/src/loader/passes/CONST_COLLAPSE.md b/src/loader/passes/CONST_COLLAPSE.md new file mode 100644 index 0000000..b46e8e8 --- /dev/null +++ b/src/loader/passes/CONST_COLLAPSE.md @@ -0,0 +1,73 @@ +# Constant Collapse Pass + +**Source:** `dag/const_collapse.rs` + +**Input:** `PlainDag` (the DAG after construction) +**Output:** `ConstCollapsedDag` (same DAG, with some constant references replaced by inline constants) + +## Purpose + +This optional optimization pass identifies constant values that can be folded +into the instructions that consume them, eliminating the need for a separate +register to hold the constant. This is driven by the target ISA: if the ISA +supports immediate operands on certain instructions (e.g., RISC-V's `addi`), +the constant can be inlined directly. + +## How It Works + +The pass is gated by `Settings::get_const_collapse_processor()`. If the ISA +implementor returns `None`, no collapsing is performed and the DAG passes +through unchanged. + +If a processor function is provided, the pass walks every `WASMOp` node in the +DAG and checks whether any of its inputs reference constant nodes. For each +such node, it calls the processor with the operator and a slice of +`MaybeConstant` values describing each input: + +- **`NonConstant`**: The input is not a constant. +- **`ReferenceConstant { value, must_collapse }`**: The input references a + constant node with a known value. The processor can set `must_collapse` to + `true` to indicate the constant should be inlined. +- **`CollapsedConstant(value)`**: The input is already an inline constant + (from a previous pass; not expected in the default pipeline). + +When `must_collapse` is set to `true`, the pass replaces the `NodeInput::Reference` +with a `NodeInput::Constant`, severing the dependency on the constant node. + +## Example + +Before collapse: +``` +Node 0: Inputs → [x] +Node 1: i32.const 5 → [5] +Node 2: i32.add ← [(0,0), (1,0)] → [result] +``` + +If the ISA processor recognizes that `i32.add` with a constant second operand +can become an "add immediate" instruction, it sets `must_collapse = true` for +input 1. After collapse: + +``` +Node 0: Inputs → [x] +Node 1: i32.const 5 → [5] (may now be unused) +Node 2: i32.add ← [(0,0), Constant(5)] → [result] +``` + +Node 1 is now potentially dangling (no references to it). The dangling removal +pass will clean it up later. + +## Recursion Into Blocks + +The pass recurses into block sub-DAGs. For non-loop blocks, it propagates +knowledge of which block inputs are constants, so that constants flowing through +block boundaries can also be collapsed inside the block. + +For loops, constant inputs are **not** propagated, because a loop input might be +constant on the first iteration but different on subsequent iterations (it could +be updated by a break back to the loop header). In practice, optimized WASM +rarely has constant loop inputs anyway. + +## Statistics + +The pass returns the total count of collapsed constants, which is aggregated in +`Statistics::constants_collapsed`. diff --git a/src/loader/passes/CONST_DEDUP.md b/src/loader/passes/CONST_DEDUP.md new file mode 100644 index 0000000..3b701d0 --- /dev/null +++ b/src/loader/passes/CONST_DEDUP.md @@ -0,0 +1,87 @@ +# Constant Deduplication Pass + +**Source:** `dag/const_dedup.rs` + +**Input:** `ConstCollapsedDag` (DAG after constant collapse) +**Output:** `ConstDedupDag` (DAG with deduplicated constants) + +## Purpose + +After the DAG is constructed, the same constant value may be defined by multiple +independent nodes (e.g., two different `i32.const 0` instructions). This pass +deduplicates them: all references to a given constant value are remapped to +point to the first definition of that constant in the current scope. + +This reduces the number of nodes in the DAG and, more importantly, saves +registers in the final output — without deduplication, each constant definition +would occupy its own register. + +## Algorithm + +The pass does a single forward traversal over the nodes, maintaining two maps: + +### `const_to_origin: HashMap>` + +Maps each known constant value to the node that defines it. The `Option` is +`Some(origin)` if the constant is defined at the current depth, or `None` if it +is known from an outer scope but not yet materialized at this depth. + +### `origin_to_const: HashMap` + +The reverse map: for every node that defines a constant, records what constant +value it produces. + +For each node: + +1. **Remap inputs:** If an input references a node that produces a known + constant, and a previous definition of that constant exists, redirect the + input to the earlier definition. + +2. **Record constants:** If the node itself defines a constant, add it to both + maps. If a previous definition already exists, the node is now a duplicate + (it will be cleaned up by the dangling removal pass). + +3. **Recurse into blocks:** For non-loop blocks, the parent's constant + knowledge is inherited. If a constant from the parent scope is needed inside + the block, a new block input is added to thread it through. + +4. **Loops start fresh:** Loop sub-DAGs start with empty maps, because + constants should be redefined inside the loop rather than copied through the + iteration interface (which would add unnecessary loop inputs). + +## Example + +Before deduplication: +``` +Node 0: Inputs → [x] +Node 1: i32.const 0 → [zero_a] +Node 2: i32.add ← [(0,0), (1,0)] → [x_plus_0] +Node 3: i32.const 0 → [zero_b] (duplicate!) +Node 4: i32.sub ← [(0,0), (3,0)] → [x_minus_0] +``` + +After deduplication, node 4's input is remapped to node 1: +``` +Node 0: Inputs → [x] +Node 1: i32.const 0 → [zero] +Node 2: i32.add ← [(0,0), (1,0)] → [x_plus_0] +Node 3: i32.const 0 → [zero_b] (now unreferenced) +Node 4: i32.sub ← [(0,0), (1,0)] → [x_minus_0] +``` + +Node 3 is now dangling and will be removed by the dangling removal pass. + +## Cross-Block Deduplication + +When a constant is defined in the parent scope and needed inside a child block, +the pass adds a new input to the block to thread the constant through, rather +than allowing the block to redefine it. This ensures that the constant occupies +a single register even across block boundaries. + +This does not apply to loops, where constants are cheaper to redefine than to +carry as loop inputs. + +## Statistics + +The pass returns the total count of deduplicated constants, which is aggregated +in `Statistics::constants_deduplicated`. diff --git a/src/loader/passes/DAG_CONSTRUCTION.md b/src/loader/passes/DAG_CONSTRUCTION.md new file mode 100644 index 0000000..a3f728a --- /dev/null +++ b/src/loader/passes/DAG_CONSTRUCTION.md @@ -0,0 +1,139 @@ +# DAG Construction Pass + +**Source:** `dag/mod.rs` + +**Input:** `LiftedBlockTree` (block tree with explicit locals data flow) +**Output:** `Dag` (directed acyclic graph of value-producing nodes) + +## Purpose + +This pass eliminates the WASM stack and local variables entirely, replacing them +with a directed acyclic graph where every value has a single explicit origin. +Nodes in the graph are operations (WASM instructions, blocks, loops, breaks), +and edges are values flowing from producers to consumers. After this pass, the +IR is fully register-like — there is no stack, no locals, just values identified +by `(node_index, output_index)` pairs. + +## Core Data Structures + +### Node + +Each node has: +- **`operation`**: What the node does (`Inputs`, `WASMOp`, `BrIfZero`, + `BrTable`, or `Block`). +- **`inputs: Vec`**: The values this node consumes. Each input is + either a `Reference(ValueOrigin)` pointing to another node's output, or a + `Constant(WasmValue)` (only after the constant collapse pass). +- **`output_types: Vec`**: The types of values this node produces. + +### ValueOrigin + +A `(node_index, output_index)` pair identifying a specific output of a specific +node. This is the "register name" in the DAG world. + +### Dag + +A `Dag` is simply a `Vec`. Node 0 is always an `Inputs` node whose +outputs are the block's input values. + +## Algorithm + +The pass simulates WASM execution using two structures that track where each +value lives: + +- **Stack** (`Vec`): Mirrors the WASM operand stack, but instead + of holding values, it holds references to the nodes that produced them. +- **Locals** (`Vec`): Maps each local index to either a `ValueOrigin` + (if the local has been set) or `UnusedLocal` (if it has never been written). + +The pass walks the instruction sequence in order. For each instruction: + +1. **Stack/local manipulation** (`local.get`, `local.set`, `local.tee`, + `drop`): Resolved purely by moving references between the stack and locals + arrays. No DAG nodes are created. + +2. **Break instructions** (`br`, `br_if`, `br_if_zero`, `br_table`): Pop the + appropriate values from the stack and collect the required local values (as + determined by the locals data flow pass). These become the break node's + inputs. The break targets are looked up in a block stack to determine what + types are expected. + +3. **Regular WASM operations**: Pop inputs from the stack, create a new node, + push the node's outputs onto the stack. + +4. **Blocks and loops**: Recursively build a sub-DAG. The block's stack and + local inputs (from the lifted block tree) become the inputs to the new + sub-DAG. The block's outputs go back onto the parent's stack and locals. + +## Example + +Consider this WASM fragment (inside a function with `$x` as parameter 0): +```wasm +local.get $x ;; push $x +i32.const 1 ;; push 1 +i32.add ;; pop both, push ($x + 1) +``` + +The resulting DAG nodes would be: + +``` +Node 0: Inputs → outputs: [$x] +Node 1: WASMOp(i32.const 1) → outputs: [1] +Node 2: WASMOp(i32.add) ← inputs: [(0,0), (1,0)] + → outputs: [result] +``` + +`local.get` does not create a node — it just pushes `(0, 0)` onto the stack +(referring to the first output of the Inputs node). The `i32.add` node +references both the input parameter and the constant. + +## Block Handling + +Blocks in the DAG are represented as a single node with an embedded sub-DAG: + +``` +Node N: Block { + kind: Block | Loop, + sub_dag: Dag { nodes: [...] } +} +inputs: [stack values..., local values...] +output_types: [stack results..., local results...] +``` + +Inside the sub-DAG, node 0 (`Inputs`) provides the block's input values. Breaks +to the block provide the block's output values. + +### Blocks vs. Loops + +The key difference is what break targets mean: + +- **Block:** A `br 0` targets the block's *outputs*. The break carries the + values that become the block's results. +- **Loop:** A `br 0` targets the loop's *inputs*. The break carries the values + that become the next iteration's inputs. + +This means a block's outputs are determined by its breaks, while a loop's +inputs may be updated by breaks back to it. + +## Unused Locals + +When a local is read before being written (its initial value is used), the pass +materializes a default constant for it (0 for numeric types, `ref.null` for +reference types). This happens at the function level only — inside blocks, +attempting to read an uninitialized local triggers a panic, because the locals +data flow pass should have already ensured it is provided as a block input. + +## Design Notes + +- The pass produces a DAG, not a general graph, because WASM's structured + control flow guarantees that non-loop value dependencies are always acyclic. + Loops create nested sub-DAGs, so even loop back-edges don't introduce cycles + at any single DAG level. + +- Constants at this stage are represented as zero-input `WASMOp` nodes (e.g., + `WASMOp(i32.const 42)`). The constant collapse and dedup passes will later + optimize them. + +- The `BreakArgs` struct on the block stack tracks both the expected stack types + and the expected local indices for each break target, combining the + information from the block's interface type and the lifted locals data flow. diff --git a/src/loader/passes/DANGLING_REMOVAL.md b/src/loader/passes/DANGLING_REMOVAL.md new file mode 100644 index 0000000..1bc0591 --- /dev/null +++ b/src/loader/passes/DANGLING_REMOVAL.md @@ -0,0 +1,103 @@ +# Dangling Removal Pass + +**Source:** `dag/dangling_removal.rs` + +**Input:** `ConstDedupDag` (DAG after constant deduplication) +**Output:** `DanglingOptDag` (DAG with unused nodes and outputs removed) + +## Purpose + +This pass is a dead code elimination step for the DAG. It removes: + +1. **Dangling nodes:** Nodes whose outputs are never used by any other node and + that have no side effects (pure computations whose results are discarded). +2. **Unused block outputs:** Block outputs that are never consumed by the parent + DAG. +3. **Unused block inputs:** Block inputs (for non-loop blocks) that are never + read inside the block. + +This pass is the natural cleanup after constant collapse and dedup, which may +leave constant nodes unreferenced. It also catches dead code patterns in the +original WASM. + +## Algorithm + +The pass operates in two phases, recursing into block sub-DAGs: + +### Phase 1: Bottom-Up Usage Analysis + +Starting from the last node and working backward: + +1. **Recurse into blocks** to clean their sub-DAGs first. +2. **Check each node:** Is any of its outputs referenced by a later node? If + not, and the node has no side effects, mark it for removal. +3. **Mark inputs as used:** For every node that is kept, mark all of its inputs' + origins as used. + +### Phase 2: Top-Down Removal and Remapping + +Traverse the nodes forward: + +1. **Remove marked nodes** from the node list. +2. **Remap references:** All `ValueOrigin` references in remaining nodes are + adjusted to account for removed nodes (shifted indices) and removed block + outputs (shifted output indices). +3. **Fix break inputs:** Break instructions targeting blocks that had outputs + removed get their corresponding inputs removed as well. + +## What Counts as Pure + +A node is considered pure (safe to remove if unused) if its operation is one of: + +- Constants (`i32.const`, `i64.const`, `f32.const`, `f64.const`, `v128.const`) +- Arithmetic, bitwise, and comparison operations +- Type conversion operations +- Reference operations (`ref.null`, `ref.is_null`, `ref.func`) +- `select` / `typed_select` +- `global.get` (reading state has no side effects) +- Memory and table reads (`i32.load`, `table.get`, `memory.size`, etc.) + +Everything else is considered to have side effects and is never removed, even if +its outputs are unused. This includes stores, calls, `global.set`, table +mutations, and `unreachable`. + +## Example + +Before removal: +``` +Node 0: Inputs → [x] +Node 1: i32.const 5 → [five] (unused after const collapse) +Node 2: i32.add ← [(0,0), Constant(5)] → [result] +Node 3: i32.const 99 → [ninety_nine] (never used at all) +Node 4: br 0 ← [(2,0)] +``` + +After removal, nodes 1 and 3 are pure and unused: +``` +Node 0: Inputs → [x] +Node 1: i32.add ← [(0,0), Constant(5)] → [result] +Node 2: br 0 ← [(1,0)] +``` + +All references are remapped: what was `(2,0)` becomes `(1,0)`. + +## Block Output Pruning + +When a block has outputs that the parent never reads, those outputs are removed +from the block node's `output_types`, and the corresponding break inputs are +removed from all breaks targeting that block. This cascading effect may make +additional nodes inside the block dangling, which are then caught by the +recursive application of the same pass inside the block. + +## Block Input Pruning + +For non-loop blocks, if an input is never read by any node inside the block, it +is removed from the `Inputs` node's output types and from the parent's block +node's inputs. Loop inputs are not pruned, as it would require adjusting all +back-edge breaks, which is more complex. + +## Statistics + +The pass returns: +- `removed_nodes`: Total pure nodes removed across all scopes. +- `removed_block_outputs`: Total unused block outputs pruned. diff --git a/src/loader/passes/LOCALS_DATA_FLOW.md b/src/loader/passes/LOCALS_DATA_FLOW.md new file mode 100644 index 0000000..c43911d --- /dev/null +++ b/src/loader/passes/LOCALS_DATA_FLOW.md @@ -0,0 +1,122 @@ +# Locals Data Flow Pass + +**Source:** `locals_data_flow.rs` + +**Input:** `BlockTree` (normalized block tree from parsing) +**Output:** `LiftedBlockTree` (block tree with explicit local variable flow) + +## Purpose + +In WASM, local variables are implicit mutable state that can flow freely in and +out of blocks without being declared in the block's type. This pass makes that +flow explicit: for every block, it computes which locals must be provided as +inputs and which are produced as outputs, so that later passes can treat blocks +as pure functions of their inputs. + +After this pass, local variables are no longer "magic" — they are just +additional block inputs and outputs alongside the stack values declared in the +block's interface type. + +## What It Computes + +For each block, the pass fills in three sets: + +### `input_locals: BTreeSet` + +The set of local indices that the block reads (directly or through nested +blocks/breaks) before any internal write. These locals must be provided to the +block as inputs, in addition to its stack parameters. + +### `output_locals: BTreeSet` (blocks only) + +The set of local indices that the block's break instructions write. Since the +block tree pass guarantees all blocks are exited via breaks, these represent the +locals modified by the block that are visible to the parent scope. + +### `carried_locals: BTreeSet` (loops only) + +The set of local indices that are carried across loop iterations. When a break +targets a loop (i.e., continues to the next iteration), any locals it modifies +must be provided as loop inputs on every iteration. + +## Example + +Consider this WASM function: +```wasm +(func (param $x i32) (result i32) + (local $acc i32) + (local.set $acc (i32.const 0)) + (block $exit (result i32) + (loop $loop + ;; acc = acc + x + (local.set $acc (i32.add (local.get $acc) (local.get $x))) + ;; if acc > 100, break with acc + (br_if $exit (i32.gt_s (local.get $acc) (i32.const 100))) + ;; continue loop + (br $loop) + ) + (unreachable) + ) +) +``` + +After lifting, the block and loop annotations would be: +- **$exit block:** `input_locals = {$acc, $x}`, `output_locals = {$acc}` + (break to $exit carries $acc) +- **$loop:** `input_locals = {$acc, $x}`, `carried_locals = {$acc}` + (break back to $loop carries $acc) + +The key insight is that `$x` is read inside the loop but never written, so it +appears as an input at every level. `$acc` is both read and written, so it +appears as both an input and a carried/output local. + +## Algorithm + +The pass works by iterating over each block until a fixed point is reached: + +1. **Push the block onto a control stack.** The control stack tracks what + locals each nesting level expects from breaks targeting it. + +2. **Scan the block's elements:** + - `local.get` → Mark the local as an input of the current block. + - `local.set` / `local.tee` → Mark the local as an output of the current + scope (tracked in `local_outputs` on the control stack entry). + - `br` / `br_if` / `br_if_zero` / `br_table` → Process the break target: + all locals output by scopes up to the target depth, plus all carried + locals of intervening loops, are added as break locals for the target. + - Nested blocks → Recurse; the sub-block's `input_locals` become inputs of + the current block, and its `output_locals` become outputs of the current + scope. + +3. **Pop the block from the control stack** and assemble the final + `input_locals`, `output_locals`, and `carried_locals`. + +4. **Repeat until stable.** If any set grew during the scan, the block is + reprocessed. This handles cases where a break target's locals requirements + propagate to inner blocks that didn't know about them yet. + +## The Control Stack + +Each entry in the control stack tracks: + +- `old_break_locals`: The break locals known from the previous iteration (for + convergence checking). +- `new_break_locals`: The break locals discovered during the current iteration. +- `carried_locals`: For loops, the locals that must be carried across + iterations. +- `local_outputs`: The locals written (via `local.set`/`local.tee`) at this + scope level. + +## Design Notes + +- The fixed-point iteration is necessary because a break can target an outer + block, and the locals it requires may depend on locals computed by other + breaks at different nesting levels. Each iteration propagates this information + one level further. + +- The sets are `BTreeSet` for deterministic ordering, which ensures that + locals are always laid out in the same order in the block interface. + +- After this pass, every local variable reference (`local.get`, `local.set`) + still exists in the instruction stream. They will be resolved into DAG node + references by the DAG construction pass. diff --git a/src/loader/passes/PIPELINE.md b/src/loader/passes/PIPELINE.md new file mode 100644 index 0000000..92cc0ff --- /dev/null +++ b/src/loader/passes/PIPELINE.md @@ -0,0 +1,139 @@ +# Common Pipeline Overview + +The common pipeline is the shared frontend that both the WOM and RWM backends +consume. It takes raw WebAssembly bytecode and progressively transforms it into +a blockless DAG — a flat, optimized, register-based intermediate representation +that is ready for backend-specific lowering. + +## Stages + +``` +WASM bytecode + │ + ▼ +Unparsed (raw function body bytes) + │ + ▼ +BlockTree block_tree.rs + │ Parses WASM operators into a tree of blocks and + │ loops. Normalizes if-else into block+br_if, + │ converts return into br, removes dead code, and + │ inlines constant globals. + │ + ▼ +LiftedBlockTree locals_data_flow.rs + │ Makes locals data flow explicit: exposes every + │ local read/write as a block input or output, so + │ later passes can treat locals like any other value. + │ + ▼ +PlainDag dag/mod.rs + │ Builds a directed acyclic graph where nodes are + │ operations and edges are values. The WASM stack + │ and locals are fully resolved into node references. + │ + ▼ +ConstCollapsedDag dag/const_collapse.rs + │ (Optional) Collapses constant values into the + │ instructions that use them, if the target ISA + │ supports immediate operands. + │ + ▼ +ConstDedupDag dag/const_dedup.rs + │ Deduplicates identical constant definitions so + │ each unique constant is defined at most once per + │ scope. + │ + ▼ +DanglingOptDag dag/dangling_removal.rs + │ Removes pure nodes whose outputs are never used. + │ Also trims unused block inputs and outputs. + │ + ▼ +BlocklessDag blockless_dag.rs + Flattens non-loop blocks into a single linear + sequence with labels. Only loops retain their + own sub-DAG. Forward-only jumps target labels; + backward jumps target loop headers. +``` + +## What Happens After + +The `BlocklessDag` is the handoff point to the backend pipelines: + +- **WOM pipeline** (`src/loader/wom/`): Flattens the DAG into write-once + register directives using frame allocation. See `wom/` for details. + +- **RWM pipeline** (`src/loader/rwm/`): Performs liveness analysis, register + allocation, and flattening into read-write register directives. See + `rwm/PIPELINE.md` for details. + +## Detailed Documentation + +Each pass has its own documentation file: + +- **[BLOCK_TREE.md](BLOCK_TREE.md)** — Parsing WASM operators into a + normalized block tree with structural simplifications. + +- **[LOCALS_DATA_FLOW.md](LOCALS_DATA_FLOW.md)** — Lifting locals into + explicit block inputs and outputs. + +- **[DAG_CONSTRUCTION.md](DAG_CONSTRUCTION.md)** — Building the value DAG from + the lifted block tree, resolving the stack and locals. + +- **[CONST_COLLAPSE.md](CONST_COLLAPSE.md)** — ISA-driven constant folding + into immediate operands. + +- **[CONST_DEDUP.md](CONST_DEDUP.md)** — Deduplicating identical constant + nodes across scopes. + +- **[DANGLING_REMOVAL.md](DANGLING_REMOVAL.md)** — Dead node elimination and + unused output pruning. + +- **[BLOCKLESS_DAG.md](BLOCKLESS_DAG.md)** — Flattening block structure into + labels and converting to the blockless representation. + +## Key Design Decisions + +### DAG Over SSA + +The IR uses a DAG (directed acyclic graph) rather than a traditional SSA form. +Each node represents an operation, each edge represents a value. Values are +identified by their origin: `(node_index, output_index)`. This is a natural fit +because WASM's structured control flow guarantees that non-loop blocks can be +inlined into the parent, producing a flat sequence of forward-only jumps. + +### Locals Are Lifted Early + +WASM locals act as implicit mutable state that crosses block boundaries. By +lifting them into explicit block inputs and outputs in the `LiftedBlockTree` +pass, all subsequent passes can treat the IR as purely value-based, with no +hidden state. This simplifies the DAG construction and all downstream +optimizations. + +### Loops Are Special + +Throughout the pipeline, loops receive special treatment: + +- **Block tree:** Loops are wrapped in an outer block if they can fall through, + ensuring loops are only exited via breaks. +- **DAG construction:** Loops create nested sub-DAGs with their own input + nodes; breaks to a loop target its inputs (next iteration), while breaks to a + block target its outputs. +- **Blockless DAG:** Only loops retain their own sub-DAG. Non-loop blocks are + inlined into the parent frame with labels for jump targets. + +This distinction reflects a fundamental property: blocks have forward-only +control flow (can be inlined), while loops have backward edges (need their own +frame). + +### Optimization Order + +The three DAG optimizations run in a specific order for good reason: + +1. **Constant collapse** runs first because it changes reference inputs into + inline constants, potentially making the original constant nodes unused. +2. **Constant dedup** runs second because collapse may have severed some + references, leaving duplicate constants that can now be merged. +3. **Dangling removal** runs last as a cleanup pass, garbage-collecting any + nodes that the previous passes made unreachable.