Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,10 @@ stability label.
| `source/cloudevents` | CloudEvents codec with structured and binary content modes. | experimental |
| `source/cdc` | Change-data-capture codec: decode Debezium/OpenCDC change events, drive by key. | experimental |
| `source/statemachine` | Bridge: an inbound message drives a transition, ack tied to the durable commit. | experimental |
| `durable` | Durable-execution runtime: record and replay nondeterminism to survive a crash. | experimental |
| `cluster` | Distribution runtime: remote actors, supervision, and live instance migration. | experimental |
| `transport` | gRPC network transport for cluster: remote deliver/spawn and time-travel. | experimental |
| `wasm` | Run state behaviors as WebAssembly: polyglot guards over a JSON ABI via wazero. | experimental |

source also ships composable reliability middleware as its own opt-in modules
(`source/retry`, `source/dlq`, `source/idempotency`, `source/schema`) and an
Expand All @@ -77,8 +81,11 @@ hierarchical, parallel, and final states, history, guard combinators, delayed
transitions, invoked services, an actor model, snapshots, and JSON
(de)serialization, backed by its `analysis`, `evolution`, and `conformance`
packages. `telemetry`, `sink`, and `source` (with all their adapters, codecs, and
middleware) are released and documented; `broker` is planned. Treat every API as
experimental until it reaches v1.
middleware) are released and documented, as are the host-side runtimes over the
kernel: `durable` (durable execution), `cluster` (distribution and live
migration), `transport` (the gRPC network transport for cluster), and `wasm`
(polyglot behaviors). `broker` is planned. Treat every API as experimental until
it reaches v1.

## Roadmap

Expand Down
18 changes: 18 additions & 0 deletions docs/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,24 @@ export default defineConfig({
// external streams and drive statecharts.
items: [{ autogenerate: { directory: 'source' } }],
},
{
label: 'Telemetry',
// The vendor-neutral observability seam the IO modules depend on,
// plus its optional backend adapters.
items: [{ autogenerate: { directory: 'telemetry' } }],
},
{
label: 'Runtimes',
// Host-side runtimes layered additively over the state kernel:
// durable execution, distribution, the gRPC transport, and polyglot
// WASM behaviors. Each leaves the kernel pure and stdlib-only.
items: [
{ label: 'Durable execution', items: [{ autogenerate: { directory: 'durable' } }] },
{ label: 'Cluster', items: [{ autogenerate: { directory: 'cluster' } }] },
{ label: 'Transport', items: [{ autogenerate: { directory: 'transport' } }] },
{ label: 'WASM behaviors', items: [{ autogenerate: { directory: 'wasm' } }] },
],
},
{
label: 'State machine',
items: [
Expand Down
103 changes: 103 additions & 0 deletions docs/src/content/docs/cluster/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: What is crucible/cluster
description: A host-side distribution runtime that spreads a state machine and its actors across nodes with remote delivery, supervision, and live instance migration over a pluggable transport.
sidebar:
order: 1
---

<!-- IMAGE-SLOT: cluster-overview-nodes (a sky-squid smith routing molten streams between several connected crucibles on different anvils, one casting being carried to another anvil mid-pour; ember/copper on steel) 16:9 -->

`crucible/cluster` is the host-side **distribution runtime** for the
[`state`](/crucible/start/introduction/) kernel: remote actors, supervision, and
live instance migration. `state` runs a machine and its child-machine actors in
one process; cluster spreads that across nodes. A parent on one node addresses
and drives an actor running on another, failures are supervised with
restart/backoff strategies, and a running instance can be migrated to a different
node, all over a pluggable `Transport`, with the kernel left pure and stdlib-only.

The runtime is **additive** over the kernel. It consumes seams the kernel already
reserves (the opaque `ActorRef` whose `Node` locator names the owning host, the
injectable `ActorSystem`, the `Snapshot`/`Restore` pair, and the typed
`ActorEscalation`/`EscalationHandler`) and needs no kernel change beyond the
additive `ActorRef.Node` locator. The core is itself stdlib-only; transport
dependencies live behind the `Transport` interface, out of the core.

## Remote actors

A `System` wraps a node's local `state.ActorSystem` with a node identity and an
optional `Transport`. Operations on a ref this node owns are delegated locally;
operations on a ref another node owns are routed over the transport.

```go
tr := cluster.NewInMemoryTransport()

nodeA := cluster.NewSystem("node-a", actorSysA, cluster.WithTransport(tr))
nodeB := cluster.NewSystem("node-b", actorSysB, cluster.WithTransport(tr))
tr.Register("node-a", nodeA)
tr.Register("node-b", nodeB)

// Spawn a worker on node-b from node-a, then drive it through the returned ref.
ref, err := nodeA.Spawn(ctx, "node-b", "worker", "w-1", nil)
_, err = nodeA.Deliver(ctx, ref, "start") // routed to node-b
```

The ref is opaque: the holder never learns where the actor runs. A ref this node
owns has an empty `Node`; a remote ref carries the owning node. A `System` with no
`Transport` serves its local actors transparently and reports `ErrNoTransport` for
a remote ref. The in-tree `InMemoryTransport` connects node-scoped systems in one
process; a real network transport (see [`transport`](/crucible/transport/overview/))
implements the same `Transport` interface.

## Supervision

A `Supervisor` turns the kernel's typed `ActorEscalation` into a per-source policy.
Wire it with `ActorSystem.WithEscalationHandler(sup.Handle)`; each failed actor is
routed to a `Decision` by the src it was spawned from:

| Decision | Behavior |
| --- | --- |
| `Escalate` | forward the failure to a sink up the hierarchy (the default) |
| `Stop` | contain the failure at this level |
| `Restart` | re-spawn through a `Respawner` (the `System`), bounded by a per-src budget; on exhaustion, escalate |
| `Backoff` | defer the re-spawn behind an exponentially growing delay; the host applies due restarts via `Tick` |

```go
sup := cluster.NewSupervisor(
cluster.WithRestart("worker", 3), // up to 3 immediate restarts
cluster.WithBackoff("flaky", 5, 100*time.Millisecond, time.Minute, 2.0),
cluster.WithEscalationSink(parentHandler),
)
sup.SetRespawner(node)
actorSys.WithEscalationHandler(sup.Handle)
// ... drive backoff restarts from a timer loop:
for range ticker.C { sup.Tick(ctx) }
```

Backoff reads time through an injected `state.Clock` (`WithClock`, default the
system clock), so it is deterministic under a `state.FakeClock` in tests.

## Live migration

`Capture` snapshots a running instance, its actor tree, and its machine definition
into a wire-shippable `Checkpoint`; `Restore` rebuilds it on another node, resuming
in place. The move is **gated on schema compatibility**. `Restore` diffs the source
and target machine definitions with
[`state/evolution`](/crucible/analysis/evolution/) and refuses a breaking target
with `ErrIncompatibleMigration`, so an instance never resumes against a definition
that would misread its state.

```go
cp, err := cluster.Capture(inst, actorSys, machine) // on the source node
// ... ship cp (it is all JSON) to the target node ...
inst, sys, err := cluster.Restore(ctx, cp, targetMachine, // on the target node
cluster.WithActorBehaviors(palette))
// err is ErrIncompatibleMigration if targetMachine is a breaking change.
```

## Where it fits

cluster is feature-complete on the in-memory transport, which is enough to build
and test a distributed topology in one process. To carry deliveries, spawns, and
time-travel queries between real nodes over the network, pair it with
[`transport`](/crucible/transport/overview/), the gRPC implementation of the same
`Transport` seam.
123 changes: 123 additions & 0 deletions docs/src/content/docs/durable/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
title: What is crucible/durable
description: A host-side durable-execution runtime that records every nondeterministic value a state instance consumes and replays them to rebuild it byte-identically after a crash.
sidebar:
order: 1
---

<!-- IMAGE-SLOT: durable-overview-replay (a sky-squid smith re-pouring a recorded molten stream back into the same mold so the casting comes out identical after the forge was relit; ember/copper on steel) 16:9 -->

`crucible/durable` is the host-side **durable-execution runtime** for the
[`state`](/crucible/start/introduction/) kernel. `state` is a pure statechart
engine: firing an event is a deterministic function of the instance's recorded
inputs. durable makes a running instance survive a process crash by recording
every nondeterministic value the instance consumes (clock readings,
invoked-service results, actor outcomes) and persisting each step before it is
acknowledged. Recovery **replays** those recorded values back through the same
pure transition function, so a recovered instance reaches exactly the
configuration, context, and history of a run that never crashed, without
re-invoking any external source.

The runtime is **additive** over the kernel. It consumes seams the kernel already
reserves (`Snapshot.Journal`, the `EffectEnvelope.EffectID` correlation slot, and
the injectable `Clock` / `ServiceRunner` / `ActorSystem` drivers) and requires no
change to the kernel, which stays pure and stdlib-only. Heavy dependencies
(database drivers, cloud SDKs) never enter this module: persistent backends
implement the `Store` interface out of tree.

## Guarantees

- **Deterministic replay.** A recovered instance is byte-identical to one that
never crashed, because recovery replays the recorded driving events and
nondeterministic results rather than re-executing their sources.
- **Exactly-once effects.** A domain effect is applied exactly once over the
instance's lifetime (the live run plus any number of recoveries), even though
the replay loop is at-least-once. Each effect carries a deterministic
`EffectID` and is deduplicated through the `Store`'s dispatch set.
- **Durability across restart.** Every `Fire` step is write-ahead appended to the
`Store` before it is acknowledged, so a crash after a successful `Fire` never
loses the step. Periodic checkpoints bound the tail that recovery replays.

## The shape of it

Wire a `Runner` around a machine and a `Store`. `Start` creates a fresh durable
instance; `Fire` drives it; `Recover` rebuilds it from the `Store` after a crash.

```go
runner := durable.NewRunner(machine, durable.NewMemStore())

// Start a fresh instance: persists a baseline checkpoint.
h, err := runner.Start(ctx, "order-42", OrderInput{ /* ... */ })

// Drive it. Each Fire write-ahead appends a Record before acknowledging.
_, err = h.Fire(ctx, "submit")

// ...process crashes, comes back up...

// Recover purely from the Store: load the latest checkpoint, replay the tail.
h, err = durable.Recover(ctx, machine, store, "order-42")
_, err = h.Fire(ctx, "confirm") // continues recording from the live tip
```

For a hot path firing many events in sequence, keep the `Handle` from `Start` or
`Recover` and call `Handle.Fire` directly to avoid a `Store` round-trip per step.
For a stateless handler that fires a single event per request, `Runner.Fire`
loads, replays, fires, and re-records in one call.

## The seams

Each source of nondeterminism is isolated behind an injectable driver, recorded
the first time, and replayed verbatim on recovery:

| Seam | Wire with | Recorded as |
| --- | --- | --- |
| **Clock** (timers) | `WithRunnerClock` | `JournalClockRead`, replayed so timers fire at the same recorded instants, wall-clock-independent; armed deadlines survive checkpoint compaction |
| **Invoked services** | `WithServiceRegistry` + `Handle.RunService` | `JournalServiceResult`; the service runs once, then recovery replays its result through the kernel's settle seam |
| **Child-machine actors** | `WithActorPalette` + `Handle.DeliverToActor` | `JournalActorMessage`; the behavior runs once, then recovery re-fires the recorded parent transition |
| **Domain effects** | `WithEffectHandler` | dispatch set, applied exactly once via deterministic `EffectID` dedup |

Use `WithCheckpointEvery(n)` to tune how often a full snapshot is written: a
shorter interval bounds recovery replay, a longer one cuts checkpoint cost.

## Stores

`Store` is the persistence seam. A durable instance is an ordered log of
`Record`s (one per `Fire` step) layered over periodic full-snapshot checkpoints.
Two stdlib-only reference implementations ship in-tree:

- **`MemStore`** (`NewMemStore`): in-memory, thread-safe, not durable across
restarts. For tests, examples, and single-process development. `WithHistory`
retains the full record history, enabling time-travel below.
- **`FileStore`** (`NewFileStore`): on-disk, a directory of per-instance
subdirectories, each an append-only journal, an atomic checkpoint, an
idempotency ledger, and a dispatched-effect log. Each append flushes to stable
storage; each checkpoint uses write-temp-plus-rename for crash-safe atomicity.
Use it for durability across restarts without a database.

Persistent database backends (PostgreSQL, DynamoDB, and the like) implement
`Store` out of tree, so their drivers never burden this module's dependency or
vulnerability surface.

## Time-travel reader

`StateAt` reconstructs an instance's state as of any recorded step, read-only:
restoring the start baseline and replaying recorded events forward to the target
step, running no service, re-instantiating no actor, reading no wall clock, and
dispatching no effect:

```go
view, err := durable.StateAt(ctx, machine, store, "order-42", 3)
// view.Snapshot(), view.Instance(), view.Step(): detached and safe to read
```

Time-travel needs the full record history through the target step. A `Store` opts
in by implementing `HistoryStore` (the in-tree `MemStore` does so under
`WithHistory`); `StateAt` otherwise falls back to the latest checkpoint plus tail.

## A note on serialized payloads

Events, service done-data, actor done-data, and actor messages are recorded as
their JSON form. A parent reducer that type-asserts a non-JSON Go type from
`AssignCtx.Event` observes the JSON-decoded shape on a replayed `onDone`. A
typed-codec option to carry the concrete Go value across the journal boundary is
reserved for a later, additive change.
Loading
Loading