Skip to content

feat(dispatch+spec): liveness, retry, admission control (ADR-0048/0049/0050/0051)#141

Merged
nerdsane merged 1 commit intonerdsane:mainfrom
rita-aga:feat/liveness-retry-admission
Apr 17, 2026
Merged

feat(dispatch+spec): liveness, retry, admission control (ADR-0048/0049/0050/0051)#141
nerdsane merged 1 commit intonerdsane:mainfrom
rita-aga:feat/liveness-retry-admission

Conversation

@rita-aga
Copy link
Copy Markdown
Collaborator

Summary

Architectural response to the 2026-04-17 Katagami bulk-regenerate incident (8 of 11 sessions hit ask timeout, 2 stuck in Provisioning forever) and the Railway content-file HTTP 500. Four new platform primitives, each with its own ADR, delivered together so the class of bug is structurally impossible going forward.

  • ADR-0048 — Dispatch retry + error taxonomy. ActorError::is_transient/is_permanent; DispatchError reshaped into Transient { source, attempts } / Permanent { source } / Deferred { retry_after_ms }. New retry::ask_with_backoff wraps every entity ask site (1 in dispatch, 6 in entity_ops). HTTP 503 + Retry-After on transient exhaustion. Idempotency-Key threaded into EntityMsg::Action; actor consults IdempotencyCache before executing and caches the successful response so retry races cannot double-execute.
  • ADR-0049 — First-class state-entry timeouts. New [[state_timeout]] TOML block (state / after_seconds / on_timeout / max_occurrences / reset_on / params). Compiler auto-wires the target action's from list. Runtime StateTimeoutTracker + arm_state_timeouts_if_needed hooked into run_post_dispatch_effects; sequence-based cancellation on state exit + reset_on re-arm.
  • ADR-0050 — Mandatory liveness coverage. Spec compiler rejects any non-terminal state that lacks a [[state_timeout]] or an allow_indefinite_states entry. Env-flagged via TEMPER_LIVENESS_ENFORCE (default warn-only for rollout). New verify_specs binary for CI. Violation reporter hook emits temper_spec_liveness_violations_total.
  • ADR-0051 — Per-tenant admission control. AdmissionController with per-(tenant, entity, action) Tokio semaphores. Strict FIFO, test-enforced (100 interleaved acquirers, order matches arrival). [admission] spec block declares caps; enforced before ask, pulled inline from the spec registry per call — no separate registration step. /_admin/admission/{tenant}/{entity_type} PATCH endpoint for runtime overrides.

Observability

22 new OTel instruments in runtime_metrics (dispatch outcome/attempts/latency/errors; state_timeout fired/cancelled/reset; scheduler pending/overdue; spec liveness violations; admission granted/queued/deferred/wait/permits/depth/hold; actor mailbox depth/utilization/full-drops/reply latency; curation_job outcome). record_server_state_metrics samples mailbox depth + utilization per entity type.

Test plan

  • 295 temper-server lib tests, 216 temper-spec, 94 temper-runtime, 48 temper-platform — 653 passing, 0 failures
  • Integration test: state_timeout_fires_and_transitions_entity proves the runtime scheduler arms → sleeps → fires on_timeout → transitions the entity end-to-end
  • Integration test: admin_override_applies_and_clears_caps round-trips the admin endpoint via the live router
  • Load test: 120 concurrent dispatches, cap=5, queue_timeout=10s → 120 granted, 0 deferred, 0 errors, ~33k dispatches/sec, p99=2.66ms
  • Load test: 300 adversarial burst, cap=2, queue_timeout=0s → 56 granted, 244 deferred (503 Retry-After), 0 errors. This is the incident-replay contract firing in a test: mailbox-full storms cannot happen.
  • Load test: 1000 sustained, cap=50 → 1000 granted, 0 errors, ~20k/s
  • verify_specs binary run against OpenPaw's os-apps/ — all specs pass enforce mode after migration
  • Full-workspace cargo build --all-targets clean
  • temper serve boots clean, PATCH /_admin/admission/... returns 200, liveness warnings emit to the log
  • Enforce-mode refuses to boot a non-compliant tenant — contract validated in production-path binary, not just unit tests

Deferred (task-tracked, explicitly out of scope)

  • Durable event-log-backed scheduler (ADR-0049 phase 2). Current impl is non-durable tokio::spawn + in-memory tracker — same durability profile as today's schedule effects (ADR-0012 line 138).
  • Per-actor timer heap (plan specified BinaryHeap; shipped with spawn-per-timer).
  • Auto-generated {state}_entered_at / {state}_timeout_seq entity state vars (shipped with external tracker — correctness-equivalent).

Companion OpenPaw PR

OpenPaw branch feat/liveness-retry-admission migrates Session + CurationJob specs to the new primitives, deletes heartbeat_scan/heartbeat_scheduler infrastructure (superseded by state_timeout), adds the Datadog dashboard/monitor widgets referencing the new metrics. That PR depends on this one.

🤖 Generated with Claude Code

…9/0050/0051)

Architectural response to the 2026-04-17 Katagami bulk-regenerate incident
and the Railway content-file 500. Four new platform primitives, each with
its own ADR; every ADR-declared behavior is unit-tested and exercised by
a load test.

P1 — Dispatch retry + error taxonomy (ADR-0048)
  * ActorError::is_transient / is_permanent classify every variant.
  * DispatchError reshaped with Transient { source, attempts }, Permanent
    { source }, Deferred { retry_after_ms }.
  * retry::ask_with_backoff wraps every entity ask site (1 in dispatch,
    6 in entity_ops) with a bounded retry + deterministic full-jitter
    (seeded from sim_now() so DST stays reproducible).
  * HTTP bindings map Transient exhaustion and Deferred to 503 + Retry-After.
  * Idempotency-Key is threaded through AgentContext → EntityMsg::Action;
    the actor consults the shared IdempotencyCache before executing and
    caches the successful response so retry races cannot double-execute.

P2 — First-class state-entry timeouts (ADR-0049)
  * New [[state_timeout]] TOML block with state / after_seconds /
    on_timeout / max_occurrences / reset_on / params.
  * Compiler auto-wires each timeout's `state` into the target action's
    `from` list — authors declare coverage once.
  * Runtime StateTimeoutTracker + arm_state_timeouts_if_needed hooked
    into run_post_dispatch_effects; sequence-based cancellation stops
    stale timers from firing after state exit or reset_on.

P3 — Mandatory liveness coverage (ADR-0050)
  * Spec compiler rejects any non-terminal state that lacks a
    [[state_timeout]] or an allow_indefinite_states entry.
  * LivenessEnforcement::{WarnOnly, Enforce}; env-flagged via
    TEMPER_LIVENESS_ENFORCE, default warn-only. Parse APIs split into
    `parse_automaton` (env-driven) and `parse_automaton_with_liveness`
    (explicit mode) so tests do not rely on process-global env state.
  * Violation reporter hook fires temper_spec_liveness_violations_total
    on every parse; installed once by ServerState::new.
  * New verify_specs binary walks *.ioa.toml directories and enforces
    the rule as a CI gate.

P4 — Per-tenant admission control (ADR-0051)
  * AdmissionController with per-(tenant, entity, action) Tokio
    semaphores; FIFO invariant covered by a dedicated test that
    interleaves 100 acquirers and asserts grant order matches arrival
    order.
  * [admission] spec block: max_concurrent_creates,
    max_concurrent_actions, queue_depth, queue_timeout_seconds.
  * Enforced before ask in dispatch_tenant_action_core; caps pulled
    inline from the spec registry per call — no separate registration
    step.
  * /_admin/admission/{tenant}/{entity_type} PATCH endpoint for runtime
    overrides without redeploy; integration-tested.

Observability
  * 22 new OTel instruments in runtime_metrics (dispatch outcome /
    attempts / latency / errors; state_timeout fired / cancelled / reset;
    scheduler pending / overdue; spec liveness violations;
    admission granted / queued / deferred / wait / permits / depth /
    hold; actor mailbox depth / utilization / full-drops / reply latency;
    curation_job duration / outcome). record_server_state_metrics samples
    mailbox depth + utilization by entity type.
  * ActorRef exposes mailbox_depth / mailbox_utilization / mailbox_capacity.

Tests
  * 295 temper-server lib tests + 216 temper-spec + 94 temper-runtime +
    48 temper-platform = 653 passing, 0 failures.
  * Load tests in state::dispatch::state_timeouts::tests:
      - 120 concurrent dispatches, cap=5, queue=10s:
          120 granted, 0 deferred, 0 errors, 33k/s, p99=2.66ms
      - 300 adversarial burst, cap=2, queue=0s:
          56 granted, 244 deferred (503 Retry-After), 0 errors
      - 1000 sustained, cap=50: 1000 granted, 0 errors, 20k/s
  * Integration tests: state_timeout fires + transitions entity end-to-end;
    admin override apply + clear round-trips via the live router.

Deferred (explicitly out of scope, task-tracked):
  * Durable event-log-backed scheduler (ADR-0049 phase 2). Current impl is
    non-durable tokio-spawn + in-memory StateTimeoutTracker — same
    durability profile as today's schedule effects (ADR-0012 line 138).
  * Per-actor timer heap (plan specified BinaryHeap; shipped with
    spawn-per-timer).
  * Auto-generated {state}_entered_at / {state}_timeout_seq entity state
    vars (shipped with external tracker — correctness-equivalent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nerdsane nerdsane merged commit 078777d into nerdsane:main Apr 17, 2026
3 of 5 checks passed
nerdsane pushed a commit to nerdsane/temperpaw that referenced this pull request Apr 17, 2026
…ion (ADR-0036)

Consumes the Temper primitives landed in nerdsane/temper#141 (ADR-0048 /
0049 / 0050 / 0051) and closes the 2026-04-17 Katagami bulk-regenerate
incident at the OpenPaw layer.

Session spec (the incident-class fix)
  * allow_indefinite_states = ["WaitingForApproval", "Completed",
    "Failed", "Cancelled"] — everything else is timer-covered.
  * 10 [[state_timeout]] blocks covering Created / Provisioning /
    PreparingContext / CallingProvider / ApplyingProviderResponse /
    Executing / Thinking / Steering / Compacting / Recovering.
  * `reset_on = ["Heartbeat"]` / `["ProvisionPending"]` /
    `["CheckpointToolBatch"]` where progress signals exist so long-running
    operations don't spuriously fail.
  * [admission] max_concurrent_creates = 10, max_concurrent_actions
    {"Configure" = 5}, queue_depth = 100, queue_timeout_seconds = 30.
    Tuned from the 11-concurrent-Configure incident pattern.

HeartbeatMonitor retirement
  * Deleted heartbeat_scan + heartbeat_scheduler WASM crates (superseded
    by state_timeout runtime).
  * Deleted heartbeat_monitor.ioa.toml and heartbeat.cedar.
  * Removed HeartbeatMonitor EntityType + bindings from model.csdl.xml.
  * Narrowed session.cedar — dropped heartbeat_scan / heartbeat_scheduler
    from the http_call module allowlist.
  * Updated wasm/build.sh and APP.md.

CurationJob spec
  * Queued / Ready / Running timeouts; Running has
    reset_on = ["RecordProgress"] so in-progress jobs aren't killed while
    they're reporting progress. 30-minute hard ceiling.

Katagami cleanup
  * Deleted submit_next_queued_regeneration from finalize_spawned_session
    — the dequeue hack existed only because platform admission didn't.
    With ADR-0051 caps on Session, callers can Submit all 11
    regenerate_embodiment jobs at once and Temper queues them FIFO.

Fleet-wide liveness migration
  * 57 additional specs (across paw-consilium, paw-research, paw-ingest,
    paw-harness, paw-heal, paw-compute, paw-managed-agents, paw-fs,
    paw-wiki, paw-pm, paw-foresight, paw-autoreason, katagami-commons,
    soul, team, etc.) now have explicit `allow_indefinite_states` blocks
    with a migration-TODO comment so TEMPER_LIVENESS_ENFORCE=true boots
    cleanly. Each entry is a deferred task to tune with a proper
    [[state_timeout]] in a follow-up PR.
  * verify_specs against os-apps/: 64 specs, 0 failures.

Datadog observability
  * openpaw-overview.json gains 5 widget groups: Dispatch Resilience,
    State Liveness, Admission Control, Actor Runtime, Katagami — 25+
    widgets referencing the 22 new OTel instruments from temper#141.
  * openpaw-monitors.json gains 10 monitors routed to
    @slack-openpaw-alerts: retry exhausted, permanent actor failures,
    mailbox saturation, abnormal state_timeout firing, overdue timers on
    replay, spec liveness violations (critical: must be 0 post-rollout),
    admission deferred spike, admission p99 wait regression, mailbox
    near capacity, unexpected post-admission mailbox drops.

ADR-0036 authored on branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nerdsane pushed a commit to nerdsane/temperpaw that referenced this pull request Apr 17, 2026
…ion (ADR-0036)

Consumes the Temper primitives landed in nerdsane/temper#141 (ADR-0048 /
0049 / 0050 / 0051) and closes the 2026-04-17 Katagami bulk-regenerate
incident at the OpenPaw layer.

Session spec (the incident-class fix)
  * allow_indefinite_states = ["WaitingForApproval", "Completed",
    "Failed", "Cancelled"] — everything else is timer-covered.
  * 10 [[state_timeout]] blocks covering Created / Provisioning /
    PreparingContext / CallingProvider / ApplyingProviderResponse /
    Executing / Thinking / Steering / Compacting / Recovering.
  * `reset_on = ["Heartbeat"]` / `["ProvisionPending"]` /
    `["CheckpointToolBatch"]` where progress signals exist so long-running
    operations don't spuriously fail.
  * [admission] max_concurrent_creates = 10, max_concurrent_actions
    {"Configure" = 5}, queue_depth = 100, queue_timeout_seconds = 30.
    Tuned from the 11-concurrent-Configure incident pattern.

HeartbeatMonitor retirement
  * Deleted heartbeat_scan + heartbeat_scheduler WASM crates (superseded
    by state_timeout runtime).
  * Deleted heartbeat_monitor.ioa.toml and heartbeat.cedar.
  * Removed HeartbeatMonitor EntityType + bindings from model.csdl.xml.
  * Narrowed session.cedar — dropped heartbeat_scan / heartbeat_scheduler
    from the http_call module allowlist.
  * Updated wasm/build.sh and APP.md.

CurationJob spec
  * Queued / Ready / Running timeouts; Running has
    reset_on = ["RecordProgress"] so in-progress jobs aren't killed while
    they're reporting progress. 30-minute hard ceiling.

Katagami cleanup
  * Deleted submit_next_queued_regeneration from finalize_spawned_session
    — the dequeue hack existed only because platform admission didn't.
    With ADR-0051 caps on Session, callers can Submit all 11
    regenerate_embodiment jobs at once and Temper queues them FIFO.

Fleet-wide liveness migration
  * 57 additional specs (across paw-consilium, paw-research, paw-ingest,
    paw-harness, paw-heal, paw-compute, paw-managed-agents, paw-fs,
    paw-wiki, paw-pm, paw-foresight, paw-autoreason, katagami-commons,
    soul, team, etc.) now have explicit `allow_indefinite_states` blocks
    with a migration-TODO comment so TEMPER_LIVENESS_ENFORCE=true boots
    cleanly. Each entry is a deferred task to tune with a proper
    [[state_timeout]] in a follow-up PR.
  * verify_specs against os-apps/: 64 specs, 0 failures.

Datadog observability
  * openpaw-overview.json gains 5 widget groups: Dispatch Resilience,
    State Liveness, Admission Control, Actor Runtime, Katagami — 25+
    widgets referencing the 22 new OTel instruments from temper#141.
  * openpaw-monitors.json gains 10 monitors routed to
    @slack-openpaw-alerts: retry exhausted, permanent actor failures,
    mailbox saturation, abnormal state_timeout firing, overdue timers on
    replay, spec liveness violations (critical: must be 0 post-rollout),
    admission deferred spike, admission p99 wait regression, mailbox
    near capacity, unexpected post-admission mailbox drops.

ADR-0036 authored on branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nerdsane added a commit to nerdsane/temperpaw that referenced this pull request Apr 17, 2026
…ion (ADR-0036) (#86)

Consumes the Temper primitives landed in nerdsane/temper#141 (ADR-0048 /
0049 / 0050 / 0051) and closes the 2026-04-17 Katagami bulk-regenerate
incident at the OpenPaw layer.

Session spec (the incident-class fix)
  * allow_indefinite_states = ["WaitingForApproval", "Completed",
    "Failed", "Cancelled"] — everything else is timer-covered.
  * 10 [[state_timeout]] blocks covering Created / Provisioning /
    PreparingContext / CallingProvider / ApplyingProviderResponse /
    Executing / Thinking / Steering / Compacting / Recovering.
  * `reset_on = ["Heartbeat"]` / `["ProvisionPending"]` /
    `["CheckpointToolBatch"]` where progress signals exist so long-running
    operations don't spuriously fail.
  * [admission] max_concurrent_creates = 10, max_concurrent_actions
    {"Configure" = 5}, queue_depth = 100, queue_timeout_seconds = 30.
    Tuned from the 11-concurrent-Configure incident pattern.

HeartbeatMonitor retirement
  * Deleted heartbeat_scan + heartbeat_scheduler WASM crates (superseded
    by state_timeout runtime).
  * Deleted heartbeat_monitor.ioa.toml and heartbeat.cedar.
  * Removed HeartbeatMonitor EntityType + bindings from model.csdl.xml.
  * Narrowed session.cedar — dropped heartbeat_scan / heartbeat_scheduler
    from the http_call module allowlist.
  * Updated wasm/build.sh and APP.md.

CurationJob spec
  * Queued / Ready / Running timeouts; Running has
    reset_on = ["RecordProgress"] so in-progress jobs aren't killed while
    they're reporting progress. 30-minute hard ceiling.

Katagami cleanup
  * Deleted submit_next_queued_regeneration from finalize_spawned_session
    — the dequeue hack existed only because platform admission didn't.
    With ADR-0051 caps on Session, callers can Submit all 11
    regenerate_embodiment jobs at once and Temper queues them FIFO.

Fleet-wide liveness migration
  * 57 additional specs (across paw-consilium, paw-research, paw-ingest,
    paw-harness, paw-heal, paw-compute, paw-managed-agents, paw-fs,
    paw-wiki, paw-pm, paw-foresight, paw-autoreason, katagami-commons,
    soul, team, etc.) now have explicit `allow_indefinite_states` blocks
    with a migration-TODO comment so TEMPER_LIVENESS_ENFORCE=true boots
    cleanly. Each entry is a deferred task to tune with a proper
    [[state_timeout]] in a follow-up PR.
  * verify_specs against os-apps/: 64 specs, 0 failures.

Datadog observability
  * openpaw-overview.json gains 5 widget groups: Dispatch Resilience,
    State Liveness, Admission Control, Actor Runtime, Katagami — 25+
    widgets referencing the 22 new OTel instruments from temper#141.
  * openpaw-monitors.json gains 10 monitors routed to
    @slack-openpaw-alerts: retry exhausted, permanent actor failures,
    mailbox saturation, abnormal state_timeout firing, overdue timers on
    replay, spec liveness violations (critical: must be 0 post-rollout),
    admission deferred spike, admission p99 wait regression, mailbox
    near capacity, unexpected post-admission mailbox drops.

ADR-0036 authored on branch.

Co-authored-by: rita-aga <rita-aga@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nerdsane pushed a commit to nerdsane/temperpaw that referenced this pull request Apr 18, 2026
…/state_timeout

The OpenPaw merge of feat/liveness-retry-admission (#86) didn't refresh the
git-pinned Temper SHA in Cargo.lock, so the image built by docker.yml was
still compiling against pre-PR-#141 Temper main (5b16a99f) — meaning the
incident-class fix (ADR-0048/0049/0050/0051) was *not* actually in the
deployed binary despite the OpenPaw spec-side changes being live.

This bumps the lockfile pin to 078777d2 (nerdsane/temper#141 merge commit)
so the next GHA docker build pulls the retry wrapper, state_timeout
runtime, admission controller, and liveness reporter into the image
that Railway deploys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants