feat(dispatch+spec): liveness, retry, admission control (ADR-0048/0049/0050/0051)#141
Merged
nerdsane merged 1 commit intonerdsane:mainfrom Apr 17, 2026
Merged
Conversation
…9/0050/0051)
Architectural response to the 2026-04-17 Katagami bulk-regenerate incident
and the Railway content-file 500. Four new platform primitives, each with
its own ADR; every ADR-declared behavior is unit-tested and exercised by
a load test.
P1 — Dispatch retry + error taxonomy (ADR-0048)
* ActorError::is_transient / is_permanent classify every variant.
* DispatchError reshaped with Transient { source, attempts }, Permanent
{ source }, Deferred { retry_after_ms }.
* retry::ask_with_backoff wraps every entity ask site (1 in dispatch,
6 in entity_ops) with a bounded retry + deterministic full-jitter
(seeded from sim_now() so DST stays reproducible).
* HTTP bindings map Transient exhaustion and Deferred to 503 + Retry-After.
* Idempotency-Key is threaded through AgentContext → EntityMsg::Action;
the actor consults the shared IdempotencyCache before executing and
caches the successful response so retry races cannot double-execute.
P2 — First-class state-entry timeouts (ADR-0049)
* New [[state_timeout]] TOML block with state / after_seconds /
on_timeout / max_occurrences / reset_on / params.
* Compiler auto-wires each timeout's `state` into the target action's
`from` list — authors declare coverage once.
* Runtime StateTimeoutTracker + arm_state_timeouts_if_needed hooked
into run_post_dispatch_effects; sequence-based cancellation stops
stale timers from firing after state exit or reset_on.
P3 — Mandatory liveness coverage (ADR-0050)
* Spec compiler rejects any non-terminal state that lacks a
[[state_timeout]] or an allow_indefinite_states entry.
* LivenessEnforcement::{WarnOnly, Enforce}; env-flagged via
TEMPER_LIVENESS_ENFORCE, default warn-only. Parse APIs split into
`parse_automaton` (env-driven) and `parse_automaton_with_liveness`
(explicit mode) so tests do not rely on process-global env state.
* Violation reporter hook fires temper_spec_liveness_violations_total
on every parse; installed once by ServerState::new.
* New verify_specs binary walks *.ioa.toml directories and enforces
the rule as a CI gate.
P4 — Per-tenant admission control (ADR-0051)
* AdmissionController with per-(tenant, entity, action) Tokio
semaphores; FIFO invariant covered by a dedicated test that
interleaves 100 acquirers and asserts grant order matches arrival
order.
* [admission] spec block: max_concurrent_creates,
max_concurrent_actions, queue_depth, queue_timeout_seconds.
* Enforced before ask in dispatch_tenant_action_core; caps pulled
inline from the spec registry per call — no separate registration
step.
* /_admin/admission/{tenant}/{entity_type} PATCH endpoint for runtime
overrides without redeploy; integration-tested.
Observability
* 22 new OTel instruments in runtime_metrics (dispatch outcome /
attempts / latency / errors; state_timeout fired / cancelled / reset;
scheduler pending / overdue; spec liveness violations;
admission granted / queued / deferred / wait / permits / depth /
hold; actor mailbox depth / utilization / full-drops / reply latency;
curation_job duration / outcome). record_server_state_metrics samples
mailbox depth + utilization by entity type.
* ActorRef exposes mailbox_depth / mailbox_utilization / mailbox_capacity.
Tests
* 295 temper-server lib tests + 216 temper-spec + 94 temper-runtime +
48 temper-platform = 653 passing, 0 failures.
* Load tests in state::dispatch::state_timeouts::tests:
- 120 concurrent dispatches, cap=5, queue=10s:
120 granted, 0 deferred, 0 errors, 33k/s, p99=2.66ms
- 300 adversarial burst, cap=2, queue=0s:
56 granted, 244 deferred (503 Retry-After), 0 errors
- 1000 sustained, cap=50: 1000 granted, 0 errors, 20k/s
* Integration tests: state_timeout fires + transitions entity end-to-end;
admin override apply + clear round-trips via the live router.
Deferred (explicitly out of scope, task-tracked):
* Durable event-log-backed scheduler (ADR-0049 phase 2). Current impl is
non-durable tokio-spawn + in-memory StateTimeoutTracker — same
durability profile as today's schedule effects (ADR-0012 line 138).
* Per-actor timer heap (plan specified BinaryHeap; shipped with
spawn-per-timer).
* Auto-generated {state}_entered_at / {state}_timeout_seq entity state
vars (shipped with external tracker — correctness-equivalent).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nerdsane
pushed a commit
to nerdsane/temperpaw
that referenced
this pull request
Apr 17, 2026
…ion (ADR-0036) Consumes the Temper primitives landed in nerdsane/temper#141 (ADR-0048 / 0049 / 0050 / 0051) and closes the 2026-04-17 Katagami bulk-regenerate incident at the OpenPaw layer. Session spec (the incident-class fix) * allow_indefinite_states = ["WaitingForApproval", "Completed", "Failed", "Cancelled"] — everything else is timer-covered. * 10 [[state_timeout]] blocks covering Created / Provisioning / PreparingContext / CallingProvider / ApplyingProviderResponse / Executing / Thinking / Steering / Compacting / Recovering. * `reset_on = ["Heartbeat"]` / `["ProvisionPending"]` / `["CheckpointToolBatch"]` where progress signals exist so long-running operations don't spuriously fail. * [admission] max_concurrent_creates = 10, max_concurrent_actions {"Configure" = 5}, queue_depth = 100, queue_timeout_seconds = 30. Tuned from the 11-concurrent-Configure incident pattern. HeartbeatMonitor retirement * Deleted heartbeat_scan + heartbeat_scheduler WASM crates (superseded by state_timeout runtime). * Deleted heartbeat_monitor.ioa.toml and heartbeat.cedar. * Removed HeartbeatMonitor EntityType + bindings from model.csdl.xml. * Narrowed session.cedar — dropped heartbeat_scan / heartbeat_scheduler from the http_call module allowlist. * Updated wasm/build.sh and APP.md. CurationJob spec * Queued / Ready / Running timeouts; Running has reset_on = ["RecordProgress"] so in-progress jobs aren't killed while they're reporting progress. 30-minute hard ceiling. Katagami cleanup * Deleted submit_next_queued_regeneration from finalize_spawned_session — the dequeue hack existed only because platform admission didn't. With ADR-0051 caps on Session, callers can Submit all 11 regenerate_embodiment jobs at once and Temper queues them FIFO. Fleet-wide liveness migration * 57 additional specs (across paw-consilium, paw-research, paw-ingest, paw-harness, paw-heal, paw-compute, paw-managed-agents, paw-fs, paw-wiki, paw-pm, paw-foresight, paw-autoreason, katagami-commons, soul, team, etc.) now have explicit `allow_indefinite_states` blocks with a migration-TODO comment so TEMPER_LIVENESS_ENFORCE=true boots cleanly. Each entry is a deferred task to tune with a proper [[state_timeout]] in a follow-up PR. * verify_specs against os-apps/: 64 specs, 0 failures. Datadog observability * openpaw-overview.json gains 5 widget groups: Dispatch Resilience, State Liveness, Admission Control, Actor Runtime, Katagami — 25+ widgets referencing the 22 new OTel instruments from temper#141. * openpaw-monitors.json gains 10 monitors routed to @slack-openpaw-alerts: retry exhausted, permanent actor failures, mailbox saturation, abnormal state_timeout firing, overdue timers on replay, spec liveness violations (critical: must be 0 post-rollout), admission deferred spike, admission p99 wait regression, mailbox near capacity, unexpected post-admission mailbox drops. ADR-0036 authored on branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
5 tasks
nerdsane
pushed a commit
to nerdsane/temperpaw
that referenced
this pull request
Apr 17, 2026
…ion (ADR-0036) Consumes the Temper primitives landed in nerdsane/temper#141 (ADR-0048 / 0049 / 0050 / 0051) and closes the 2026-04-17 Katagami bulk-regenerate incident at the OpenPaw layer. Session spec (the incident-class fix) * allow_indefinite_states = ["WaitingForApproval", "Completed", "Failed", "Cancelled"] — everything else is timer-covered. * 10 [[state_timeout]] blocks covering Created / Provisioning / PreparingContext / CallingProvider / ApplyingProviderResponse / Executing / Thinking / Steering / Compacting / Recovering. * `reset_on = ["Heartbeat"]` / `["ProvisionPending"]` / `["CheckpointToolBatch"]` where progress signals exist so long-running operations don't spuriously fail. * [admission] max_concurrent_creates = 10, max_concurrent_actions {"Configure" = 5}, queue_depth = 100, queue_timeout_seconds = 30. Tuned from the 11-concurrent-Configure incident pattern. HeartbeatMonitor retirement * Deleted heartbeat_scan + heartbeat_scheduler WASM crates (superseded by state_timeout runtime). * Deleted heartbeat_monitor.ioa.toml and heartbeat.cedar. * Removed HeartbeatMonitor EntityType + bindings from model.csdl.xml. * Narrowed session.cedar — dropped heartbeat_scan / heartbeat_scheduler from the http_call module allowlist. * Updated wasm/build.sh and APP.md. CurationJob spec * Queued / Ready / Running timeouts; Running has reset_on = ["RecordProgress"] so in-progress jobs aren't killed while they're reporting progress. 30-minute hard ceiling. Katagami cleanup * Deleted submit_next_queued_regeneration from finalize_spawned_session — the dequeue hack existed only because platform admission didn't. With ADR-0051 caps on Session, callers can Submit all 11 regenerate_embodiment jobs at once and Temper queues them FIFO. Fleet-wide liveness migration * 57 additional specs (across paw-consilium, paw-research, paw-ingest, paw-harness, paw-heal, paw-compute, paw-managed-agents, paw-fs, paw-wiki, paw-pm, paw-foresight, paw-autoreason, katagami-commons, soul, team, etc.) now have explicit `allow_indefinite_states` blocks with a migration-TODO comment so TEMPER_LIVENESS_ENFORCE=true boots cleanly. Each entry is a deferred task to tune with a proper [[state_timeout]] in a follow-up PR. * verify_specs against os-apps/: 64 specs, 0 failures. Datadog observability * openpaw-overview.json gains 5 widget groups: Dispatch Resilience, State Liveness, Admission Control, Actor Runtime, Katagami — 25+ widgets referencing the 22 new OTel instruments from temper#141. * openpaw-monitors.json gains 10 monitors routed to @slack-openpaw-alerts: retry exhausted, permanent actor failures, mailbox saturation, abnormal state_timeout firing, overdue timers on replay, spec liveness violations (critical: must be 0 post-rollout), admission deferred spike, admission p99 wait regression, mailbox near capacity, unexpected post-admission mailbox drops. ADR-0036 authored on branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nerdsane
added a commit
to nerdsane/temperpaw
that referenced
this pull request
Apr 17, 2026
…ion (ADR-0036) (#86) Consumes the Temper primitives landed in nerdsane/temper#141 (ADR-0048 / 0049 / 0050 / 0051) and closes the 2026-04-17 Katagami bulk-regenerate incident at the OpenPaw layer. Session spec (the incident-class fix) * allow_indefinite_states = ["WaitingForApproval", "Completed", "Failed", "Cancelled"] — everything else is timer-covered. * 10 [[state_timeout]] blocks covering Created / Provisioning / PreparingContext / CallingProvider / ApplyingProviderResponse / Executing / Thinking / Steering / Compacting / Recovering. * `reset_on = ["Heartbeat"]` / `["ProvisionPending"]` / `["CheckpointToolBatch"]` where progress signals exist so long-running operations don't spuriously fail. * [admission] max_concurrent_creates = 10, max_concurrent_actions {"Configure" = 5}, queue_depth = 100, queue_timeout_seconds = 30. Tuned from the 11-concurrent-Configure incident pattern. HeartbeatMonitor retirement * Deleted heartbeat_scan + heartbeat_scheduler WASM crates (superseded by state_timeout runtime). * Deleted heartbeat_monitor.ioa.toml and heartbeat.cedar. * Removed HeartbeatMonitor EntityType + bindings from model.csdl.xml. * Narrowed session.cedar — dropped heartbeat_scan / heartbeat_scheduler from the http_call module allowlist. * Updated wasm/build.sh and APP.md. CurationJob spec * Queued / Ready / Running timeouts; Running has reset_on = ["RecordProgress"] so in-progress jobs aren't killed while they're reporting progress. 30-minute hard ceiling. Katagami cleanup * Deleted submit_next_queued_regeneration from finalize_spawned_session — the dequeue hack existed only because platform admission didn't. With ADR-0051 caps on Session, callers can Submit all 11 regenerate_embodiment jobs at once and Temper queues them FIFO. Fleet-wide liveness migration * 57 additional specs (across paw-consilium, paw-research, paw-ingest, paw-harness, paw-heal, paw-compute, paw-managed-agents, paw-fs, paw-wiki, paw-pm, paw-foresight, paw-autoreason, katagami-commons, soul, team, etc.) now have explicit `allow_indefinite_states` blocks with a migration-TODO comment so TEMPER_LIVENESS_ENFORCE=true boots cleanly. Each entry is a deferred task to tune with a proper [[state_timeout]] in a follow-up PR. * verify_specs against os-apps/: 64 specs, 0 failures. Datadog observability * openpaw-overview.json gains 5 widget groups: Dispatch Resilience, State Liveness, Admission Control, Actor Runtime, Katagami — 25+ widgets referencing the 22 new OTel instruments from temper#141. * openpaw-monitors.json gains 10 monitors routed to @slack-openpaw-alerts: retry exhausted, permanent actor failures, mailbox saturation, abnormal state_timeout firing, overdue timers on replay, spec liveness violations (critical: must be 0 post-rollout), admission deferred spike, admission p99 wait regression, mailbox near capacity, unexpected post-admission mailbox drops. ADR-0036 authored on branch. Co-authored-by: rita-aga <rita-aga@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nerdsane
pushed a commit
to nerdsane/temperpaw
that referenced
this pull request
Apr 18, 2026
…/state_timeout The OpenPaw merge of feat/liveness-retry-admission (#86) didn't refresh the git-pinned Temper SHA in Cargo.lock, so the image built by docker.yml was still compiling against pre-PR-#141 Temper main (5b16a99f) — meaning the incident-class fix (ADR-0048/0049/0050/0051) was *not* actually in the deployed binary despite the OpenPaw spec-side changes being live. This bumps the lockfile pin to 078777d2 (nerdsane/temper#141 merge commit) so the next GHA docker build pulls the retry wrapper, state_timeout runtime, admission controller, and liveness reporter into the image that Railway deploys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Architectural response to the 2026-04-17 Katagami bulk-regenerate incident (8 of 11 sessions hit ask timeout, 2 stuck in Provisioning forever) and the Railway content-file HTTP 500. Four new platform primitives, each with its own ADR, delivered together so the class of bug is structurally impossible going forward.
ActorError::is_transient/is_permanent;DispatchErrorreshaped intoTransient { source, attempts }/Permanent { source }/Deferred { retry_after_ms }. Newretry::ask_with_backoffwraps every entity ask site (1 in dispatch, 6 in entity_ops). HTTP 503 +Retry-Afteron transient exhaustion. Idempotency-Key threaded intoEntityMsg::Action; actor consultsIdempotencyCachebefore executing and caches the successful response so retry races cannot double-execute.[[state_timeout]]TOML block (state / after_seconds / on_timeout / max_occurrences / reset_on / params). Compiler auto-wires the target action'sfromlist. RuntimeStateTimeoutTracker+arm_state_timeouts_if_neededhooked intorun_post_dispatch_effects; sequence-based cancellation on state exit + reset_on re-arm.[[state_timeout]]or anallow_indefinite_statesentry. Env-flagged viaTEMPER_LIVENESS_ENFORCE(default warn-only for rollout). Newverify_specsbinary for CI. Violation reporter hook emitstemper_spec_liveness_violations_total.AdmissionControllerwith per-(tenant, entity, action) Tokio semaphores. Strict FIFO, test-enforced (100 interleaved acquirers, order matches arrival).[admission]spec block declares caps; enforced before ask, pulled inline from the spec registry per call — no separate registration step./_admin/admission/{tenant}/{entity_type}PATCH endpoint for runtime overrides.Observability
22 new OTel instruments in
runtime_metrics(dispatch outcome/attempts/latency/errors; state_timeout fired/cancelled/reset; scheduler pending/overdue; spec liveness violations; admission granted/queued/deferred/wait/permits/depth/hold; actor mailbox depth/utilization/full-drops/reply latency; curation_job outcome).record_server_state_metricssamples mailbox depth + utilization per entity type.Test plan
state_timeout_fires_and_transitions_entityproves the runtime scheduler arms → sleeps → fireson_timeout→ transitions the entity end-to-endadmin_override_applies_and_clears_capsround-trips the admin endpoint via the live routerverify_specsbinary run against OpenPaw's os-apps/ — all specs pass enforce mode after migrationcargo build --all-targetscleantemper serveboots clean,PATCH /_admin/admission/...returns 200, liveness warnings emit to the logDeferred (task-tracked, explicitly out of scope)
tokio::spawn+ in-memory tracker — same durability profile as today'sscheduleeffects (ADR-0012 line 138).BinaryHeap; shipped with spawn-per-timer).{state}_entered_at/{state}_timeout_seqentity state vars (shipped with external tracker — correctness-equivalent).Companion OpenPaw PR
OpenPaw branch
feat/liveness-retry-admissionmigrates Session + CurationJob specs to the new primitives, deletesheartbeat_scan/heartbeat_schedulerinfrastructure (superseded bystate_timeout), adds the Datadog dashboard/monitor widgets referencing the new metrics. That PR depends on this one.🤖 Generated with Claude Code