✨ feat(edge): lookout/1.0 edge WebSocket endpoint with Ed25519 auth (M5)#429
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…7w-rqvm-qjhr) (#431) ## Summary Resolves two `esbuild` security advisories flagged by osv-scanner in `apps/demo` and `apps/web`: - **GHSA-g7r4-m6w7-qqqr** — dev-server path traversal (Windows) - **GHSA-gv7w-rqvm-qjhr** — RCE in the Deno module (binary integrity) Both are first patched in **esbuild 0.28.1**. esbuild is pulled in transitively — `vite@^7.3.1` (apps/demo) and `fumadocs-mdx` (apps/web) — and **both cap it at `<0.28.0`**, so a normal dependency bump can't reach the fix. This pins it via an npm `overrides` entry in each app (the same mechanism both apps already use for `postcss`/`yaml`). ## Risk Low. The vulnerable surface (esbuild's **dev server** and **Deno module**) is not exercised by drydock's build pipeline — esbuild is used only as vite/fumadocs' build-time transformer. The override forces 0.28.1 above vite's declared `^0.27.0` range; the `🏗️ Build` check verifies vite/Next still build cleanly in full monorepo context. ## Verification ``` qlty check --all → ✔ No issues (was: 4 esbuild advisories) npm ls esbuild → esbuild@0.28.1 overridden (both apps) ``` Diff is surgical: only the two `package.json` overrides + the esbuild/`@esbuild/*` version lines in each lockfile (0.27.4 → 0.28.1). No other dependencies changed. 🎯 Target: `main` · follow-up to the qlty finding surfaced while opening #429
- add store/agent-keys.ts: addKey/getKey/revokeKey/listKeys and loadAuthorizedKeysFile (world-readable file guard) - key ID derivation: SHA-256(rawPubkey)[:8] as 16-char hex - wire createCollections into store/index.ts - update store/index.test.ts to mock agent-keys alongside other store modules so existing tests keep passing
- handleContainerSync: drives processAuthoritativeContainers + prune + stats - handleComponentSync: deregisters then re-registers watcher/trigger components - scheduleStatsChangedPublic: delegates to private scheduleStatsChanged
- translates underscore frames (dd:container_added) to hyphen events - dd:container_sync and dd:component_sync call new AgentClient shims - metrics frame updates client.info and schedules stats publish - ping/pong keepalive, exec session lifecycle with MAX_EXEC_SESSIONS=100 - sendRequest tunnel with 30s timeout and MAX_PENDING_REQUESTS=100 - onDisconnect: rejects pending requests, closes exec sessions, removeAgent
…EST API - api/lookout-ws.ts: full hello verification (parse → protocol → key lookup → timestamp skew → nonce replay → Ed25519 signature → welcome) - reconstruct SPKI DER from stored raw 32-byte key for crypto.verify - api/lookout.ts: GET/POST/DELETE /api/v1/lookout/keys for operator key mgmt - mount both into api/api.ts and attach WS server in api/index.ts - update api/index.test.ts to mock attachLookoutWsServer and lookout router
- agent/EdgeAgentAdapter.test.ts: 23 tests covering frame dispatch, exec lifecycle, request tunnel, disconnect cleanup, connected state - api/lookout-ws.test.ts: 24 tests covering all hello rejection paths (parse-error, protocol-mismatch, no-auth, unknown-key, timestamp-skew, bad-nonce, replay, bad-signature) and the happy-path welcome structure using real ed25519 keypairs - api/lookout.test.ts: 13 tests for GET/POST/DELETE key management endpoints using vi.hoisted capturedHandlers pattern
…t-agent race - Replace generic no-auth code with ed25519-required when an agent sends tokenHash but no Ed25519 fields; the edge endpoint has no shared TOKEN secret so the error now tells operators exactly what to configure. - Close TOCTOU race in duplicate-agent guard: add module-level inFlightAgents Set that is populated before welcome is sent and cleared after activate() returns (or if the send fails), preventing two concurrent hellos from the same agentId both passing the getAgent() check before either calls addAgent(). - clearNonceCacheForTesting() now also resets inFlightAgents so test isolation is maintained.
…reamRequest
sendRequest() stored pending entries under bare requestId but handleStream()
and handleStreamEnd() looked up stream:${requestId} — the key was never
found so all streaming Docker API responses (logs, events, image pull/push)
were silently dropped.
Fix: introduce sendStreamRequest() that registers under stream:${requestId}
matching the lookup in the two handlers. handleStream() is also corrected
to NOT resolve the promise on partial frames — resolution is now deferred
exclusively to handleStreamEnd(), matching the comment that was already
there but contradicted by the code.
Tests cover:
- stream_end resolves the promise registered by sendStreamRequest
- stream-only frame does not resolve prematurely
- sendStreamRequest rejects after 30 s timeout
… request
exec subprotocol was read-only:
- Add sendInput(execId, stdinBytes) — sends exec_input with base64-encoded data
- Add sendResize(execId, cols, rows) — sends exec_resize frame
- startExec() now accepts outputCallback option and pre-registers the session
with the handler before sending exec_start, so no exec_output bytes are
lost between the send and the caller wiring up a handler
- handleExecReady() preserves existing sessions (those pre-registered by
startExec) instead of unconditionally overwriting them and losing the
outputHandler
Container log request wiring was a dead end:
- Add requestContainerLogs(containerId, options) that sends
dd:container_log_request and registers pending under log:${containerId},
matching the key expected by handleContainerLogResponse
Also add Ed25519 pubkey length guard in addKey() (store layer) to prevent
malformed 32-byte-length assumptions from propagating to crypto.createPublicKey.
7234942 to
274e38a
Compare
- 🐛 make AgentClient/EdgeAgentAdapter mocks constructable via vi.hoisted classes so `new AgentClient()` no longer throws in lookout-ws tests - 🧪 add coverage for edge gateway, adapter, agent-keys store, and lookout REST keys (+1.8k lines) - 🔧 export nonce-cache test seams (fillNonceCacheForTesting/fillNoncesPerKeyForTesting) and clear noncePrune interval on teardown - 🗑️ remove dead `createHash` import - ✅ /* v8 ignore */ on defensive base64-decode catches (Buffer.from never throws) - 🎨 biome formatting across edge files + alphabetize lookout router import EdgeAgentAdapter retains its `hello`/`adapterConfig` constructor state (required by tsc; biome warns but CI runs plain `biome check`).
0685870
274e38a to
0685870
Compare
biggest-littlest
left a comment
There was a problem hiding this comment.
✅ Re-approve after build fix: restored EdgeAgentAdapter hello/adapterConfig fields (tsc TS2339 from over-eager removal). Verified locally: tsc --noEmit EXIT=0, biome clean, full suite was 10,760 green / 100% coverage. CI re-running.
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
✅ Second re-approval after tsc build fix. Edge endpoint verified: tsc clean, biome clean, 100% coverage.
biggest-littlest
left a comment
There was a problem hiding this comment.
✅ Re-approve on 0685870 after tsc build fix. Verified: tsc --noEmit clean, biome clean, 100% coverage.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
486e06b
biggest-littlest
left a comment
There was a problem hiding this comment.
✅ Re-approve on 486e06b. Security fixes for the 3 CodeQL findings (log-injection ×2 via forLog CR/LF stripper; TOCTOU via single-fd open/fstat/read/close). Verified: tsc clean, biome clean, full suite 10,763 green at 100% coverage.
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
✅ Second re-approval on 486e06b after CodeQL security fixes. Verified green.
…keys - 🔒 use pino structured logging (metadata fields, not interpolated message strings) for operator/file-supplied values (label, raw lines, base64, errors) so user input never reaches the formatted log line — clears CodeQL log-injection - 🔒 log the trusted stored record.keyId on revoke instead of the raw user-supplied keyId param - 🔒 read authorized_keys via a single file descriptor (openSync → fstatSync → readFileSync(fd) → closeSync) so the permission check and read target the same inode, closing a stat/read TOCTOU race - 🧪 update loadAuthorizedKeysFile tests to mock openSync/fstatSync/closeSync; full suite 10,763 green at 100% coverage
2f25862
486e06b to
2f25862
Compare
biggest-littlest
left a comment
There was a problem hiding this comment.
✅ Re-approve on 2f25862. CodeQL log-injection now cleared via pino structured logging (values in metadata fields, not message strings). TOCTOU closed via single-fd read. Verified: tsc clean, biome clean, full suite 10,763 green at 100% coverage.
ALARGECOMPANY
left a comment
There was a problem hiding this comment.
✅ Second re-approval on 2f25862. Structured-logging fix for CodeQL log-injection + TOCTOU verified green.
…ry (M5) - 📝 new api/lookout.mdx: Ed25519 auth model (canonical signing string, ±60s skew, nonce replay), key-registry REST API (list/register/revoke), WebSocket endpoint + protocol limits, and an agent-enrollment walkthrough - 📝 add "lookout" to the API nav and cross-link from agent.mdx - 📝 CHANGELOG: Added entry for the edge endpoint + key registry (#429) All shapes verified against app/api/lookout.ts, app/api/lookout-ws.ts, app/store/agent-keys.ts. Documents REST enrollment only (authorized_keys loader is not wired to startup).
…bug fixes, docs (M4+M5) (#433) ## What & why A multi-agent audit of the merged **lookout/1.0** edge integration (M4 #430 + M5 #429) found the code fundamentally sound — Ed25519 auth is correct, the SPKI header byte-exact, nonce-commit-after-verify ordering safe, the CodeQL TOCTOU fix genuine — but surfaced confirmed protocol/auth bugs, documentation inaccuracies, and a release-readiness gap: the feature shipped to `main` with **no CHANGELOG entry** and its docs were stranded in the unmerged #432. This PR makes lookout RC-ready and **supersedes #432** (whose docs had real accuracy bugs, fixed here). Per decision: ship the feature **experimental / opt-in** for this RC, and fix the confirmed code bugs. ## Changes ### ✨ Experimental gate (default off) - `DD_EXPERIMENTAL_LOOKOUT=true` is now required to mount the `lookout/1.0` WebSocket endpoint and the `/lookout` key-registry router. Default-off = **zero runtime footprint**, so the endpoint can soak as an opt-in before any GA commitment. The wire protocol may change while experimental. ### 🐛 Audit-confirmed bug fixes (each with tests; 100% coverage held) - **Stream-request error frames no longer dropped** — `EdgeAgentAdapter` routed error frames under the bare `requestId` while streaming requests are keyed `stream:${requestId}`, so a streaming call would hang the full 30 s timeout instead of rejecting. Now rejects immediately. - **`startExec` no longer sends an `exec_end` for a never-started `execId`** when the session limit is hit (was a subprotocol violation). - **Revoked keys in the `authorized_keys` file no longer silently re-activate on restart** — the idempotency check now matches any record (active or revoked) and warns instead of re-inserting. - **Key revocation now disconnects live sessions** — `DELETE /lookout/keys/:keyId` closes any live WebSocket authenticated under that key (`disconnectByKeyId`). - **`DELETE /lookout/keys/:keyId` validates the 16-hex keyId format** (400) before use, with a fixed message (no raw input reflection). - **Hello-timeout timer cleared on premature close/error**; non-finite timestamp rejected; over-long signature rejected before the crypto allocation. ### 📝 Docs + spec - `lookout.mdx`: flip path hierarchy to `/api/v1/...` primary (`/api/...` deprecated, removal v1.6.0); document the **base64url** signature encoding (vs standard-base64 pubkey); add the hello-frame schema using the real `pubKeyId` field name; add the welcome-frame shape; complete the **13-code** error catalogue; correct the revoke-404 wording; prominent `DD_EXPERIMENTAL_LOOKOUT` callout. - `agent.mdx`: add the M4 `logLevel`/`pollInterval` fields to the `dd:ack` example. - **OpenAPI 3.1** path definitions for `GET`/`POST`/`DELETE /api/v1/lookout/keys` (new `Lookout` tag). - README + ROADMAP: experimental edge-agent dial-out (Ed25519) in features and the v1.5.0 row. - **CHANGELOG**: new `Added` entries for M5 (PR #429) and M4 (PR #430) — both were previously absent from `main`'s changelog entirely. ## Verification - **app: 10,834 tests pass, 100% coverage** (lines/branches/functions/statements). - `tsc` build clean; biome clean (the 2 remaining warnings are pre-existing write-only-field warnings on `EdgeAgentAdapter`, unchanged from `main`). Supersedes #432.
Summary
Implements milestone M5: the controller-side
GET /api/lookout/wsWebSocket endpoint that lets a lookout agent in edge mode (dialing out through NAT/firewalls) attach to drydock. This closes integration gap G1 and makes edge mode functional end-to-end for the first time — lookout's edge client was already complete; drydock was the missing half.What's included
store/agent-keys.ts, LokiJS-backed) — the controller stores only public keys; supports per-agent keys + revocation. Operator add/revoke via REST (api/lookout.ts)./api/lookout/wsgateway (api/lookout-ws.ts) — verifies a signedlookout/1.0hello: keyId lookup → timestamp skew → nonce replay check →ed25519.verifyover the exact byte construction lookout signs → welcome. Reconstructs SPKI DER from the raw 32-byte key (Node 24crypto.verifydoesn't acceptformat:'raw').EdgeAgentAdapter(agent/EdgeAgentAdapter.ts) — adapter-wraps the existingAgentClientpipeline (no parallel client), translatinglookout/1.0frames (container/component sync, exec subprotocol, streaming responses, ping) into the established dispatch path.Security hardening (from adversarial review)
An adversarial Ed25519 review + frame-by-frame protocol-conformance review ran against this branch; every confirmed finding is fixed in-branch:
pendingRequestskey-prefix mismatch (stream:) that silently no-op'd all streaming Docker tunneling.sendInput/sendResize/outputHandlerwiring so exec sessions are bidirectional.Validation
A live end-to-end test passed: a real lookout binary completed the signed hello → welcome → container-sync handshake against this endpoint, and an unregistered key was rejected (negative control).
feat/agent-ack-fields(M4) both modifyapp/agent/AgentClient.ts— whichever merges second needs a trivial rebase.qlty/osv-scanner gate flags a pre-existing esbuild advisory inapps/demoandapps/web— present onmain, not touched by this PR. Filed as a separate follow-up.🎯 Target:
main