Skip to content

fix(nodes): retry daemon status during boot race to kill 10s flicker#100

Open
Nic-dorman wants to merge 1 commit into
mainfrom
fix/daemon-connect-boot-flicker
Open

fix(nodes): retry daemon status during boot race to kill 10s flicker#100
Nic-dorman wants to merge 1 commit into
mainfrom
fix/daemon-connect-boot-flicker

Conversation

@Nic-dorman
Copy link
Copy Markdown
Contributor

Summary

The "Cannot connect to node daemon" message flashes for ~10 seconds on every cold launch. Root cause: a boot race between the daemon's port-file write and its HTTP-server bind.

Cold-launch timing today

t (s) Step
0 app boot, `nodesStore.init()` runs
~0.5 `ensure_daemon_running` spawns daemon, waits for port file
~1 port file written → returns URL, but the daemon's HTTP server hasn't bound the port yet
~1 first `fetchDaemonStatus` → connection refused → throws
~1 outer catch: `daemonConnected=false`, `initializing=false`, `scheduleReconnect()` (10 s timer)
1–11 UI: "Cannot connect to node daemon"
~11 reconnect timer fires, retry succeeds, UI flips to nodes

Fix

New `fetchDaemonStatusWithRetry({ attempts, delayMs })` helper retries up to ~5 s (20 × 250 ms) inside `init()`. `initializing` stays true through the retries, so the user sees "Starting node daemon..." until the daemon actually serves or the budget exhausts.

The disconnected path and the existing `scheduleReconnect` 10 s loop are untouched — those still kick in for genuine outages.

Test plan

  • Unit: new test mocks `status` to fail once then succeed, asserts no transient `daemonConnected=false` escapes init and store ends up connected with 2 status calls
  • Manual cold launch: kill any `ant.exe` / `ant-gui.exe`, delete `%APPDATA%\ant\daemon.{pid,port}`, run `npm run tauri:dev`. Expect "Starting node daemon..." for ~1 s, then nodes list. No "Cannot connect" flash.
  • Genuine outage: rename the bundled `ant.exe` to break `ensure_daemon_running`. Launch. Expect "Cannot connect to node daemon" + retry-every-10s behavior unchanged.
  • Restart-daemon flow: click "Restart Daemon" in settings. Expect "Restarting node daemon..." (existing) — not affected by this PR.

🤖 Generated with Claude Code

Cold-launch flow before this fix:

  t=0     app boot, nodesStore.init() runs
  t=~0.5s ensure_daemon_running spawns daemon, polls for port file
  t=~1s   port file appears, ensure_daemon_running returns URL
          (daemon's HTTP server may not have bound the port yet)
  t=~1s   first fetchDaemonStatus call → connection refused → throws
  t=~1s   outer catch: daemonConnected=false, initializing=false,
          scheduleReconnect (10s timer)
  t=1-11s UI shows "Cannot connect to node daemon" for ~10s
  t=~11s  reconnect timer fires, retry succeeds, UI flips to nodes

The port file is written before the HTTP server binds, so a single
status fetch immediately after `ensure_daemon_running` reliably races
the bind on cold boot.

New behavior: `fetchDaemonStatusWithRetry` retries up to ~5s (20 ×
250ms) inside init() with `initializing=true` throughout. The user
sees "Starting node daemon..." until the daemon actually serves or
the budget exhausts. The disconnected path and scheduleReconnect
loop are unchanged for genuine outages.

Test added: mocks a first-attempt failure + second-attempt success
under fake timers. Asserts no transient `daemonConnected=false`
escapes init() and the store ends up connected with two status calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant