Skip to content

fix: surface bundle startup failures to workspace log, SSE, and health (#7)#23

Open
mgoldsborough wants to merge 1 commit intomainfrom
fix/issue-7-bundle-start-failed
Open

fix: surface bundle startup failures to workspace log, SSE, and health (#7)#23
mgoldsborough wants to merge 1 commit intomainfrom
fix/issue-7-bundle-start-failed

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Summary

  • When `startBundleSource` threw inside `startWorkspaceBundles`, the error hit `process.stderr` and was dropped. The failed bundle vanished from workspace JSONL logs, from SSE clients, and from `/v1/health` — operators had to tail container logs to know a bundle was down.
  • Root cause: the catch block had no reference to the `EventSink` and no way to inform `HealthMonitor` (the failed bundle never became an `McpSource`, which is `HealthMonitor`'s only input).

Changes

  • New `bundle.start_failed` `EngineEventType`.
  • `startWorkspaceBundles` takes the `EventSink`, emits `bundle.start_failed` on catch, and returns a `BundleStartFailure[]`.
  • `Runtime.start` stores failures and exposes `getStartFailures()`; the API server constructs `HealthMonitor` with those records so `/v1/health` shows the bundle as `dead`.
  • Added `bundle.start_failed` to `WORKSPACE_EVENTS` (JSONL log) and to the SSE forwarding allow-list.
  • `HealthMonitor.getStatus()` merges live records with start-failure records; when a source with the same name later comes up, the live record hides the earlier failure so operators don't see stale dead entries.

Startup-continues-on-failure behavior is preserved — the workspace registry is still created, platform tools still work, and other bundles in the same workspace still start.

Test plan

  • New `workspace-runtime` tests: bundle.start_failed is emitted on catch and returned in `startFailures`; `registries` still contains the workspace; no event is emitted when everything succeeds.
  • New `health-monitor` tests: `startFailures` are reported as `dead` in `getStatus()`; they merge with live source records; a startFailure is suppressed when a live source with the same name exists.
  • `workspace-log-sink` test updated: `bundle.start_failed` is in the workspace events set.
  • `bun test test/unit/` — 1730 pass, 0 fail.
  • `bun run check` / `bun run lint` — clean.

Closes #7

#7)

Previously, when startBundleSource threw inside startWorkspaceBundles,
the error was written to process.stderr and silently dropped — the
bundle simply vanished from the registry, from workspace JSONL logs,
from SSE clients, and from /v1/health. Operators had to tail
container logs to know a bundle was down; users saw "App X is not
available" with no context.

Root cause: the catch block had no reference to the EventSink and
no way to inform HealthMonitor (the failed bundle never became an
McpSource, which is HealthMonitor's only input).

Fix:

- Add a bundle.start_failed EngineEventType.
- startWorkspaceBundles now accepts the EventSink and, on catch,
  emits bundle.start_failed and returns a BundleStartFailure[] that
  the caller can forward to HealthMonitor.
- Runtime.start stores the failure list and exposes it via
  getStartFailures(); the API server constructs HealthMonitor with
  those records so /v1/health shows the bundle as `dead`.
- bundle.start_failed is added to WORKSPACE_EVENTS (JSONL log) and
  to the SSE forwarding list.
- HealthMonitor.getStatus() merges live records with start-failure
  records, suppressing a failure record when a source with the same
  name later came up (so a successful retry hides the earlier
  attempt).

Startup-continues-on-failure behavior is preserved — the workspace
registry is still created, platform tools still work, and other
bundles still start.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bundle startup failures not logged to workspace or surfaced to UI

1 participant