fix(nodes): retry daemon status during boot race to kill 10s flicker by Nic-dorman · Pull Request #100 · WithAutonomi/ant-ui

Nic-dorman · 2026-05-12T11:05:12Z

Summary

The "Cannot connect to node daemon" message flashes for ~10 seconds on every cold launch. Root cause: a boot race between the daemon's port-file write and its HTTP-server bind.

Cold-launch timing today

t (s)	Step
0	app boot, `nodesStore.init()` runs
~0.5	`ensure_daemon_running` spawns daemon, waits for port file
~1	port file written → returns URL, but the daemon's HTTP server hasn't bound the port yet
~1	first `fetchDaemonStatus` → connection refused → throws
~1	outer catch: `daemonConnected=false`, `initializing=false`, `scheduleReconnect()` (10 s timer)
1–11	UI: "Cannot connect to node daemon"
~11	reconnect timer fires, retry succeeds, UI flips to nodes

Fix

New `fetchDaemonStatusWithRetry({ attempts, delayMs })` helper retries up to ~5 s (20 × 250 ms) inside `init()`. `initializing` stays true through the retries, so the user sees "Starting node daemon..." until the daemon actually serves or the budget exhausts.

The disconnected path and the existing `scheduleReconnect` 10 s loop are untouched — those still kick in for genuine outages.

Test plan

Unit: new test mocks `status` to fail once then succeed, asserts no transient `daemonConnected=false` escapes init and store ends up connected with 2 status calls
Manual cold launch: kill any `ant.exe` / `ant-gui.exe`, delete `%APPDATA%\ant\daemon.{pid,port}`, run `npm run tauri:dev`. Expect "Starting node daemon..." for ~1 s, then nodes list. No "Cannot connect" flash.
Genuine outage: rename the bundled `ant.exe` to break `ensure_daemon_running`. Launch. Expect "Cannot connect to node daemon" + retry-every-10s behavior unchanged.
Restart-daemon flow: click "Restart Daemon" in settings. Expect "Restarting node daemon..." (existing) — not affected by this PR.

🤖 Generated with Claude Code

Cold-launch flow before this fix: t=0 app boot, nodesStore.init() runs t=~0.5s ensure_daemon_running spawns daemon, polls for port file t=~1s port file appears, ensure_daemon_running returns URL (daemon's HTTP server may not have bound the port yet) t=~1s first fetchDaemonStatus call → connection refused → throws t=~1s outer catch: daemonConnected=false, initializing=false, scheduleReconnect (10s timer) t=1-11s UI shows "Cannot connect to node daemon" for ~10s t=~11s reconnect timer fires, retry succeeds, UI flips to nodes The port file is written before the HTTP server binds, so a single status fetch immediately after `ensure_daemon_running` reliably races the bind on cold boot. New behavior: `fetchDaemonStatusWithRetry` retries up to ~5s (20 × 250ms) inside init() with `initializing=true` throughout. The user sees "Starting node daemon..." until the daemon actually serves or the budget exhausts. The disconnected path and scheduleReconnect loop are unchanged for genuine outages. Test added: mocks a first-attempt failure + second-attempt success under fake timers. Asserts no transient `daemonConnected=false` escapes init() and the store ends up connected with two status calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nodes): retry daemon status during boot race to kill 10s flicker#100

fix(nodes): retry daemon status during boot race to kill 10s flicker#100
Nic-dorman wants to merge 1 commit into
mainfrom
fix/daemon-connect-boot-flicker

Nic-dorman commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nic-dorman commented May 12, 2026

Summary

Cold-launch timing today

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant