chromium-launcher: instrument restart phases to localise slow boots by hiroTamada · Pull Request #230 · kernel/kernel-images

hiroTamada · 2026-04-27T15:49:09Z

Why

When chromium is restarted (e.g. after a flag/policy change or a profile load), callers need DevTools to come back up within a tight budget. In our downstream production deployment we see that some restarts take ~6 s and exceed that budget — causing user-visible failures — while others complete in ~1 s on the very same fleet. Restarts are clearly bimodal, but right now we cannot tell which phase eats the time on the slow ones.

The plausible suspects are:

supervisorctl restart chromium itself (SIGKILL + spawn) being slow on a degraded host.
chromium-launcher's killExistingChromium (pkill -9 + pgrep wait) or waitForPort. We disable SO_REUSEADDR, so any lingering FDs / TIME_WAIT entries on the devtools port force us to wait for kernel cleanup — up to the 5 s waitForPort deadline.
Chromium's internal init after syscall.Exec (cache warmup, profile work) running slowly post-suspend / post-standby-restore.

This PR adds the instrumentation we need to tell those apart.

What

Two structured events. Both are correlatable downstream by host / VM identity, so they can be joined with each other and with caller-side CDP-readiness timings to get an end-to-end phase breakdown.

1. `chromium_launcher_phase_complete` (in `chromium-launcher`)

Emitted as slog JSON to stderr right before syscall.Exec. Captures:

Phase timings: pkill_elapsed_ms, pgrep_wait_elapsed_ms, pgrep_attempts, pgrep_timed_out, waitport_elapsed_ms, waitport_attempts, waitport_timed_out, waitport_last_err, launcher_total_ms.
Devtools-port socket state parsed from /proc/net/tcp: ss_estab_devtools, ss_timewait_devtools, ss_listen_devtools — directly tests the lingering-FD / TIME_WAIT hypothesis.
VM state: vm_uptime_ms, load_avg_1m, mem_available_kb, mem_cached_kb, mem_dirty_kb — directly tests the post-standby-restore / host-pressure hypothesis.
instance_name (read from INSTANCE_NAME env var if set) for downstream attribution.

killExistingChromium and waitForPort are refactored to return small phase-timing structs (`killStats`, `waitportStats`); behavior is unchanged.

2. `chromium_restart_complete` (in `server/cmd/api/api/chromium.go`)

Emitted from `restartChromiumAndWait` after a successful restart. Decomposes the wall clock into:

`supervisorctl_ms` — time for `supervisorctl restart chromium` to return.
`post_supervisorctl_ms` — gap between supervisorctl returning and a new DevTools URL appearing in the supervisord log tail.
`total_ms`, `operation`, `first_devtools_url_len`.

This separates "supervisord was slow to SIGKILL+respawn" from "chromium itself was slow to initialise."

Combined with the launcher event we get: `supervisorctl_ms` → `launcher_total_ms` (with sub-phases) → chromium internal init → DevTools URL written.

`/proc` parsers

New helpers `readVMUptimeMs`, `readLoadAvg1m`, `parseMemInfo`, `countSocketStates` are all fail-soft (return zero on missing files / parse errors) and have unit tests in `main_test.go` with realistic `/proc/net/tcp` and `/proc/meminfo` fixtures (using testify `assert` / `require`).

Test plan

`go build ./...` (server)
`go vet ./...`
`go test ./cmd/chromium-launcher/...` — passes (includes new fixture-based tests)
`go test ./cmd/api/...` — passes
`go test ./lib/devtoolsproxy/...` `./lib/chromiumflags/...` — passes
After deploy, verify both events show up downstream with expected fields.
Pivot `chromium_restart_complete.total_ms` by host to confirm the bimodal distribution and check correlation with `vm_uptime_ms` / `ss_timewait_devtools` from the launcher event.

Notes

Pure observability — no behavior changes to the restart path. Existing call sites of `killExistingChromium()` and `waitForPort(...)` were updated to discard the new return values where appropriate; nothing in main `exec`'s shape changed.
The new event is written to stderr where the existing diagnostic prints already go, so it'll be picked up by the same log pipeline as everything else `chromium-launcher` emits.

cc @

Made with Cursor

Note

Medium Risk
Touches the Chromium restart/wait control flow and launcher startup path; while intended to be observability-only, mistakes could change restart timing or logging behavior and affect readiness detection.

Overview
Adds structured phase instrumentation for Chromium restarts so slow boots can be attributed to specific steps.

restartChromiumAndWait now measures supervisorctl restart duration and the delay until a new DevTools URL appears, emitting chromium_restart_complete (and chromium_restart_timeout with partial timing on timeout).

chromium-launcher now records timing for pkill/pgrep and waitForPort, plus fail-soft VM/socket telemetry from /proc (uptime, load, meminfo, /proc/net/tcp state counts), emitting a single JSON chromium_launcher_phase_complete event immediately before execing Chromium; new unit tests cover the /proc parsers.

^{Reviewed by Cursor Bugbot for commit 67244b5. Bugbot is set up for automated code reviews on this repo. Configure here.}

Production telemetry (downstream consumer) shows that some chromium restarts take ~6s and exceed callers' CDP-readiness budget, while others complete in ~1s on the same fleet. Restarts are clearly bimodal but we don't yet know which phase eats the time on slow ones. This change adds two structured events that, joined by VM/host identity downstream, let us attribute each restart to a specific phase and VM state: * chromium_launcher_phase_complete (slog JSON to stderr, emitted right before syscall.Exec): pkill / pgrep wait / waitForPort timings, ESTABLISHED / TIME_WAIT / LISTEN counts on the devtools port from /proc/net/tcp, plus VM uptime, 1m load, and memory from /proc. * chromium_restart_complete (api log line from restartChromiumAndWait): decomposes wall clock into supervisorctl_ms and post_supervisorctl_ms so we can tell whether slow restarts are spent in supervisord SIGKILL / spawn vs chromium-internal init. killExistingChromium and waitForPort are refactored to return small phase-timing structs; behavior is unchanged. /proc parsers are fail-soft and have unit tests with real-format fixtures. Made-with: Cursor

firetiger-agent · 2026-04-27T15:49:27Z

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: This PR modifies chromium-launcher instrumentation and server diagnostics, but does not change kernel API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) as specified in the filter.

To monitor this PR anyway, reply with @firetiger monitor this.

rgarcia

lgtm as long as you've given it some e2d test reps

Sayan- · 2026-05-11T22:21:53Z

+				return fmt.Errorf("supervisorctl restart failed: %w", r.err)
+			}
+			supervisorctlMs = r.elapsedMs
+			supervisorctlDoneAt = time.Now()


supervisorctlDoneAt is set when this goroutine receives from resultCh, not when the launcher goroutine finished supervisorctl. Under scheduling delay — i.e. the slow-restart case this PR exists to debug — post_supervisorctl_ms will be biased low and won't sum to total_ms. Record time.Now() inside the goroutine and pass it through restartResult.

Sayan- · 2026-05-11T22:21:59Z

+			// Drop resultCh from select so we keep waiting for updates without re-firing.
+			resultCh = nil
+		case url := <-updates:
+			postSupervisorMs := int64(-1)


When updates wins the race vs. resultCh, both supervisorctl_ms and post_supervisorctl_ms log as -1. Downstream aggregates (medians, p99s) need to filter >= 0 or they'll be poisoned by sentinels. Worth a comment here and a note wherever the query lives.

Sayan- · 2026-05-11T22:22:06Z

-	case <-timeout.C:
-		log.Info("devtools not ready in time", "operation", operation, "elapsed", time.Since(start).String())
-		return fmt.Errorf("devtools not ready in time")
+	for {


The new for-select + channel-nilling pattern is the highest-behavior-risk diff in the PR and has no test. The /proc parsers got fixtures; the actual control flow did not. Could you add a test covering: (a) supervisorctl succeeds then updates arrives, (b) updates arrives before supervisorctl returns, (c) supervisorctl errors? The PR description says "no behavior changes" but the loop shape did change — would be good to lock in.

Sayan-

few notes about timing precision. overall 🚀

hiroTamada requested review from Sayan-, rgarcia and sjmiller609 April 28, 2026 15:54

rgarcia approved these changes May 7, 2026

View reviewed changes

Sayan- reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chromium-launcher: instrument restart phases to localise slow boots#230

chromium-launcher: instrument restart phases to localise slow boots#230
hiroTamada wants to merge 1 commit into
mainfrom
hiro/chromium-restart-instrumentation

hiroTamada commented Apr 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

firetiger-agent Bot commented Apr 27, 2026

Uh oh!

rgarcia left a comment

Uh oh!

Sayan- May 11, 2026

Uh oh!

Sayan- May 11, 2026

Uh oh!

Sayan- May 11, 2026

Uh oh!

Sayan- left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hiroTamada commented Apr 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

1. chromium_launcher_phase_complete (in chromium-launcher)

2. chromium_restart_complete (in `server/cmd/api/api/chromium.go`)

`/proc` parsers

Test plan

Notes

Uh oh!

firetiger-agent Bot commented Apr 27, 2026

Uh oh!

rgarcia left a comment

Choose a reason for hiding this comment

Uh oh!

Sayan- May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Sayan- May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Sayan- May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Sayan- left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hiroTamada commented Apr 27, 2026 •

edited by cursor Bot

Loading

1. `chromium_launcher_phase_complete` (in `chromium-launcher`)

2. `chromium_restart_complete` (in `server/cmd/api/api/chromium.go`)

Sayan- left a comment •

edited

Loading