Skip to content

fix: preflight managed dolt publication before startup store work#1763

Open
azanar wants to merge 13 commits intogastownhall:mainfrom
azanar:main-fix-startup-dolt-preflight
Open

fix: preflight managed dolt publication before startup store work#1763
azanar wants to merge 13 commits intogastownhall:mainfrom
azanar:main-fix-startup-dolt-preflight

Conversation

@azanar
Copy link
Copy Markdown
Contributor

@azanar azanar commented May 6, 2026

Summary

This adds a managed-Dolt publication preflight before early startup/reconcile paths touch bead-store-backed controller state.

The failure mode was startup-time access into store-dependent controller/runtime paths before managed Dolt runtime state had been published, which could surface as transient startup errors and missing-port behavior during restart.

What changed

  • Preflight managed Dolt publication during CityRuntime startup.
  • Re-run the same preflight on tick/control-dispatch paths when the managed runtime port is still unpublished.
  • Keep the preflight gated to managed-Dolt / bd-store-contract cities only.
  • Refactor the test seam to use per-runtime injected hooks instead of package-global mutation, so the tests do not interfere with unrelated cmd/gc package tests.

Why

Some startup paths were assuming managed Dolt publication had already happened. When that assumption was false, controller/store work could race ahead of runtime publication.

This change closes that ordering gap.

Tests

Targeted tests passed:

  • go test ./cmd/gc -run 'Test(NewCityRuntimePreflightsManagedDoltPublicationBeforeStartupStoreWork|CityRuntimeEnsureManagedDoltPublishedForTick|EnsureBeadsProvider_execDoesNotReclassifyProviderAfterStart|InitBeadsForDir_execPassesCanonicalDoltDatabase|CityRuntimeTick_LogsWispGCPurgeCountWithNonFatalError)' -count=1

Validation notes

During PR prep I found a deterministic cmd/gc regression in the original test seam for this change. That regression is fixed in this branch by removing package-global test hook mutation and using per-runtime injection instead.

I would not claim a fully green repo-wide suite here. In this environment, baseline main and upstream/main are already noisy/red in broader cmd/gc coverage, so the confidence signal for this PR is the targeted regression coverage and the specific startup-ordering fix.


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

azanar and others added 13 commits April 23, 2026 10:29
* fix: add label fallback to polecat work query

The polecat work query only checked metadata-based routing
(gc.routed_to). Manual dispatch via `bd update --add-label pool:<pool>`
sets a label instead, causing work to never be found.

Now checks labels as a fallback after metadata.

witness: salvage uncommitted work from orphaned polecat (sa-ml2, recovery #14)

* test: update work query expectations for pool label fallback
The maintenance pack qualifies the dog agent as "maintenance.dog" but
all dispatch (deacon, witnesses, formulas) uses the short name "dog".
EffectiveScaleCheck and EffectiveWorkQuery both derive their routing
target from QualifiedName, so the controller saw zero demand and never
spawned dogs.

Add explicit scale_check and work_query overrides using "dog" as the
routing key so the controller and work query both see the correct demand.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without work_query, gc hook always exits 1 (no work found) even when
pool beads are ready. Polecats get the pool notification but cannot
claim work.

Root cause: gc hook sets BEADS_DIR to the rig's beads store, so pool
lookups hit rig beads (which have no pool beads). Fix uses BEADS_DIR
override to point at city-level beads for pool routing. GC_AGENT is
set to the template name by gc hook, so pool:$GC_AGENT resolves to
the correct pool label (e.g. pool:sazabi/polecat).

Pattern matches the existing dog agent work_query.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The polecat work query only checked metadata-based routing
(gc.routed_to). Manual dispatch via `bd update --add-label pool:<pool>`
sets a label instead, causing work to never be found.

Now checks labels as a fallback after metadata.

witness: salvage uncommitted work from orphaned polecat (sa-ml2, recovery #14)
Add [[named_session]] for polecat (scope=rig, mode=on_demand) so each rig
auto-spawns a polecat when work is routed to it. Mirrors refinery pattern.
Fixes permanent problem of city-scoped polecats not seeing rig-local work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When --on creates a molecule, the work bead was routed but the molecule
itself never received gc.routed_to metadata. This caused polecat's work
query to find zero molecules, leaving step beads unclaimed.

Route the molecule root after routing the work bead, ensuring both have
gc.routed_to=<target> so workers can discover and claim the work.

Fixes issue where gc sling <target> <bead> --on <formula> created a
molecule that was invisible to workers.
@azanar azanar requested a review from julianknutsen as a code owner May 6, 2026 21:26
@github-actions github-actions Bot added the status/needs-triage Inbox — we haven't looked at it yet label May 6, 2026
@randy-release-manager randy-release-manager Bot added kind/bug Broken behavior priority/p1 High — core workflow broken and removed status/needs-triage Inbox — we haven't looked at it yet labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Broken behavior priority/p1 High — core workflow broken

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant