chromium-launcher: instrument restart phases to localise slow boots#230
chromium-launcher: instrument restart phases to localise slow boots#230hiroTamada wants to merge 1 commit into
Conversation
Production telemetry (downstream consumer) shows that some chromium restarts take ~6s and exceed callers' CDP-readiness budget, while others complete in ~1s on the same fleet. Restarts are clearly bimodal but we don't yet know which phase eats the time on slow ones. This change adds two structured events that, joined by VM/host identity downstream, let us attribute each restart to a specific phase and VM state: * chromium_launcher_phase_complete (slog JSON to stderr, emitted right before syscall.Exec): pkill / pgrep wait / waitForPort timings, ESTABLISHED / TIME_WAIT / LISTEN counts on the devtools port from /proc/net/tcp, plus VM uptime, 1m load, and memory from /proc. * chromium_restart_complete (api log line from restartChromiumAndWait): decomposes wall clock into supervisorctl_ms and post_supervisorctl_ms so we can tell whether slow restarts are spent in supervisord SIGKILL / spawn vs chromium-internal init. killExistingChromium and waitForPort are refactored to return small phase-timing structs; behavior is unchanged. /proc parsers are fail-soft and have unit tests with real-format fixtures. Made-with: Cursor
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: This PR modifies chromium-launcher instrumentation and server diagnostics, but does not change kernel API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the filter. To monitor this PR anyway, reply with |
rgarcia
left a comment
There was a problem hiding this comment.
lgtm as long as you've given it some e2d test reps
| return fmt.Errorf("supervisorctl restart failed: %w", r.err) | ||
| } | ||
| supervisorctlMs = r.elapsedMs | ||
| supervisorctlDoneAt = time.Now() |
There was a problem hiding this comment.
supervisorctlDoneAt is set when this goroutine receives from resultCh, not when the launcher goroutine finished supervisorctl. Under scheduling delay — i.e. the slow-restart case this PR exists to debug — post_supervisorctl_ms will be biased low and won't sum to total_ms. Record time.Now() inside the goroutine and pass it through restartResult.
| // Drop resultCh from select so we keep waiting for updates without re-firing. | ||
| resultCh = nil | ||
| case url := <-updates: | ||
| postSupervisorMs := int64(-1) |
There was a problem hiding this comment.
When updates wins the race vs. resultCh, both supervisorctl_ms and post_supervisorctl_ms log as -1. Downstream aggregates (medians, p99s) need to filter >= 0 or they'll be poisoned by sentinels. Worth a comment here and a note wherever the query lives.
| case <-timeout.C: | ||
| log.Info("devtools not ready in time", "operation", operation, "elapsed", time.Since(start).String()) | ||
| return fmt.Errorf("devtools not ready in time") | ||
| for { |
There was a problem hiding this comment.
The new for-select + channel-nilling pattern is the highest-behavior-risk diff in the PR and has no test. The /proc parsers got fixtures; the actual control flow did not. Could you add a test covering: (a) supervisorctl succeeds then updates arrives, (b) updates arrives before supervisorctl returns, (c) supervisorctl errors? The PR description says "no behavior changes" but the loop shape did change — would be good to lock in.
Why
When chromium is restarted (e.g. after a flag/policy change or a profile load), callers need DevTools to come back up within a tight budget. In our downstream production deployment we see that some restarts take ~6 s and exceed that budget — causing user-visible failures — while others complete in ~1 s on the very same fleet. Restarts are clearly bimodal, but right now we cannot tell which phase eats the time on the slow ones.
The plausible suspects are:
supervisorctl restart chromiumitself (SIGKILL + spawn) being slow on a degraded host.chromium-launcher'skillExistingChromium(pkill -9+pgrepwait) orwaitForPort. We disableSO_REUSEADDR, so any lingering FDs / TIME_WAIT entries on the devtools port force us to wait for kernel cleanup — up to the 5 swaitForPortdeadline.syscall.Exec(cache warmup, profile work) running slowly post-suspend / post-standby-restore.This PR adds the instrumentation we need to tell those apart.
What
Two structured events. Both are correlatable downstream by host / VM identity, so they can be joined with each other and with caller-side CDP-readiness timings to get an end-to-end phase breakdown.
1.
chromium_launcher_phase_complete(inchromium-launcher)Emitted as slog JSON to stderr right before
syscall.Exec. Captures:pkill_elapsed_ms,pgrep_wait_elapsed_ms,pgrep_attempts,pgrep_timed_out,waitport_elapsed_ms,waitport_attempts,waitport_timed_out,waitport_last_err,launcher_total_ms./proc/net/tcp:ss_estab_devtools,ss_timewait_devtools,ss_listen_devtools— directly tests the lingering-FD / TIME_WAIT hypothesis.vm_uptime_ms,load_avg_1m,mem_available_kb,mem_cached_kb,mem_dirty_kb— directly tests the post-standby-restore / host-pressure hypothesis.instance_name(read fromINSTANCE_NAMEenv var if set) for downstream attribution.killExistingChromiumandwaitForPortare refactored to return small phase-timing structs (`killStats`, `waitportStats`); behavior is unchanged.2.
chromium_restart_complete(in `server/cmd/api/api/chromium.go`)Emitted from `restartChromiumAndWait` after a successful restart. Decomposes the wall clock into:
This separates "supervisord was slow to SIGKILL+respawn" from "chromium itself was slow to initialise."
Combined with the launcher event we get: `supervisorctl_ms` → `launcher_total_ms` (with sub-phases) → chromium internal init → DevTools URL written.
`/proc` parsers
New helpers `readVMUptimeMs`, `readLoadAvg1m`, `parseMemInfo`, `countSocketStates` are all fail-soft (return zero on missing files / parse errors) and have unit tests in `main_test.go` with realistic `/proc/net/tcp` and `/proc/meminfo` fixtures (using testify `assert` / `require`).
Test plan
Notes
cc @
Made with Cursor
Note
Medium Risk
Touches the Chromium restart/wait control flow and launcher startup path; while intended to be observability-only, mistakes could change restart timing or logging behavior and affect readiness detection.
Overview
Adds structured phase instrumentation for Chromium restarts so slow boots can be attributed to specific steps.
restartChromiumAndWaitnow measuressupervisorctl restartduration and the delay until a new DevTools URL appears, emittingchromium_restart_complete(andchromium_restart_timeoutwith partial timing on timeout).chromium-launchernow records timing forpkill/pgrepandwaitForPort, plus fail-soft VM/socket telemetry from/proc(uptime, load, meminfo,/proc/net/tcpstate counts), emitting a single JSONchromium_launcher_phase_completeevent immediately beforeexecing Chromium; new unit tests cover the/procparsers.Reviewed by Cursor Bugbot for commit 67244b5. Bugbot is set up for automated code reviews on this repo. Configure here.