Skip to content

✨ feat(edge): lookout/1.0 edge WebSocket endpoint with Ed25519 auth (M5)#429

Merged
scttbnsn merged 10 commits into
mainfrom
feat/lookout-edge
Jun 13, 2026
Merged

✨ feat(edge): lookout/1.0 edge WebSocket endpoint with Ed25519 auth (M5)#429
scttbnsn merged 10 commits into
mainfrom
feat/lookout-edge

Conversation

@scttbnsn

Copy link
Copy Markdown
Contributor

Summary

Implements milestone M5: the controller-side GET /api/lookout/ws WebSocket endpoint that lets a lookout agent in edge mode (dialing out through NAT/firewalls) attach to drydock. This closes integration gap G1 and makes edge mode functional end-to-end for the first time — lookout's edge client was already complete; drydock was the missing half.

What's included

  • Ed25519 public-key registry (store/agent-keys.ts, LokiJS-backed) — the controller stores only public keys; supports per-agent keys + revocation. Operator add/revoke via REST (api/lookout.ts).
  • /api/lookout/ws gateway (api/lookout-ws.ts) — verifies a signed lookout/1.0 hello: keyId lookup → timestamp skew → nonce replay check → ed25519.verify over the exact byte construction lookout signs → welcome. Reconstructs SPKI DER from the raw 32-byte key (Node 24 crypto.verify doesn't accept format:'raw').
  • EdgeAgentAdapter (agent/EdgeAgentAdapter.ts) — adapter-wraps the existing AgentClient pipeline (no parallel client), translating lookout/1.0 frames (container/component sync, exec subprotocol, streaming responses, ping) into the established dispatch path.

Security hardening (from adversarial review)

An adversarial Ed25519 review + frame-by-frame protocol-conformance review ran against this branch; every confirmed finding is fixed in-branch:

  • Nonce-cache poisoning (high): nonces are now committed only after signature verification — an unauthenticated caller can no longer exhaust the replay pool.
  • Streaming responses dropped (high): fixed a pendingRequests key-prefix mismatch (stream: ) that silently no-op'd all streaming Docker tunneling.
  • Exec subprotocol incomplete (high): added sendInput / sendResize / outputHandler wiring so exec sessions are bidirectional.
  • Plus: per-key nonce admission limit, agentId type/length/charset validation, generic (non-enumerable) auth errors, 32-byte key guard, TOCTOU fix on the duplicate-agent guard.

Validation

364 test files · 10,681 tests pass · tsc build clean · biome clean

A live end-to-end test passed: a real lookout binary completed the signed hello → welcome → container-sync handshake against this endpoint, and an unregistered key was rejected (negative control).

⚠️ Merge notes

  • Sequencing: this branch and #feat/agent-ack-fields (M4) both modify app/agent/AgentClient.ts — whichever merges second needs a trivial rebase.
  • Unrelated CI: the qlty/osv-scanner gate flags a pre-existing esbuild advisory in apps/demo and apps/web — present on main, not touched by this PR. Filed as a separate follow-up.

🎯 Target: main

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
drydock-website Error Error Jun 13, 2026 5:31am
drydockdemo-website Ready Ready Preview, Comment Jun 13, 2026 5:31am

Comment thread app/store/agent-keys.ts Fixed
Comment thread app/store/agent-keys.ts Fixed
Comment thread app/store/agent-keys.ts Fixed
Comment thread app/agent/EdgeAgentAdapter.test.ts Fixed
Comment thread app/agent/EdgeAgentAdapter.test.ts Fixed
Comment thread app/api/lookout-ws.test.ts Fixed
Comment thread app/api/lookout-ws.test.ts Fixed
Comment thread app/api/lookout-ws.test.ts Fixed
Comment thread app/api/lookout-ws.ts Fixed
Comment thread app/store/agent-keys.test.ts Fixed
scttbnsn added a commit that referenced this pull request Jun 13, 2026
…7w-rqvm-qjhr) (#431)

## Summary

Resolves two `esbuild` security advisories flagged by osv-scanner in
`apps/demo` and `apps/web`:

- **GHSA-g7r4-m6w7-qqqr** — dev-server path traversal (Windows)
- **GHSA-gv7w-rqvm-qjhr** — RCE in the Deno module (binary integrity)

Both are first patched in **esbuild 0.28.1**. esbuild is pulled in
transitively — `vite@^7.3.1` (apps/demo) and `fumadocs-mdx` (apps/web) —
and **both cap it at `<0.28.0`**, so a normal dependency bump can't
reach the fix. This pins it via an npm `overrides` entry in each app
(the same mechanism both apps already use for `postcss`/`yaml`).

## Risk

Low. The vulnerable surface (esbuild's **dev server** and **Deno
module**) is not exercised by drydock's build pipeline — esbuild is used
only as vite/fumadocs' build-time transformer. The override forces
0.28.1 above vite's declared `^0.27.0` range; the `🏗️ Build` check
verifies vite/Next still build cleanly in full monorepo context.

## Verification

```
qlty check --all  →  ✔ No issues   (was: 4 esbuild advisories)
npm ls esbuild    →  esbuild@0.28.1 overridden   (both apps)
```

Diff is surgical: only the two `package.json` overrides + the
esbuild/`@esbuild/*` version lines in each lockfile (0.27.4 → 0.28.1).
No other dependencies changed.

🎯 Target: `main` · follow-up to the qlty finding surfaced while opening
#429
scttbnsn added 8 commits June 13, 2026 00:17
- add store/agent-keys.ts: addKey/getKey/revokeKey/listKeys and
  loadAuthorizedKeysFile (world-readable file guard)
- key ID derivation: SHA-256(rawPubkey)[:8] as 16-char hex
- wire createCollections into store/index.ts
- update store/index.test.ts to mock agent-keys alongside other
  store modules so existing tests keep passing
- handleContainerSync: drives processAuthoritativeContainers + prune + stats
- handleComponentSync: deregisters then re-registers watcher/trigger components
- scheduleStatsChangedPublic: delegates to private scheduleStatsChanged
- translates underscore frames (dd:container_added) to hyphen events
- dd:container_sync and dd:component_sync call new AgentClient shims
- metrics frame updates client.info and schedules stats publish
- ping/pong keepalive, exec session lifecycle with MAX_EXEC_SESSIONS=100
- sendRequest tunnel with 30s timeout and MAX_PENDING_REQUESTS=100
- onDisconnect: rejects pending requests, closes exec sessions, removeAgent
…EST API

- api/lookout-ws.ts: full hello verification (parse → protocol → key lookup
  → timestamp skew → nonce replay → Ed25519 signature → welcome)
- reconstruct SPKI DER from stored raw 32-byte key for crypto.verify
- api/lookout.ts: GET/POST/DELETE /api/v1/lookout/keys for operator key mgmt
- mount both into api/api.ts and attach WS server in api/index.ts
- update api/index.test.ts to mock attachLookoutWsServer and lookout router
- agent/EdgeAgentAdapter.test.ts: 23 tests covering frame dispatch,
  exec lifecycle, request tunnel, disconnect cleanup, connected state
- api/lookout-ws.test.ts: 24 tests covering all hello rejection paths
  (parse-error, protocol-mismatch, no-auth, unknown-key, timestamp-skew,
  bad-nonce, replay, bad-signature) and the happy-path welcome structure
  using real ed25519 keypairs
- api/lookout.test.ts: 13 tests for GET/POST/DELETE key management endpoints
  using vi.hoisted capturedHandlers pattern
…t-agent race

- Replace generic no-auth code with ed25519-required when an agent sends
  tokenHash but no Ed25519 fields; the edge endpoint has no shared TOKEN
  secret so the error now tells operators exactly what to configure.
- Close TOCTOU race in duplicate-agent guard: add module-level inFlightAgents
  Set that is populated before welcome is sent and cleared after activate()
  returns (or if the send fails), preventing two concurrent hellos from the
  same agentId both passing the getAgent() check before either calls addAgent().
- clearNonceCacheForTesting() now also resets inFlightAgents so test isolation
  is maintained.
…reamRequest

sendRequest() stored pending entries under bare requestId but handleStream()
and handleStreamEnd() looked up stream:${requestId} — the key was never
found so all streaming Docker API responses (logs, events, image pull/push)
were silently dropped.

Fix: introduce sendStreamRequest() that registers under stream:${requestId}
matching the lookup in the two handlers.  handleStream() is also corrected
to NOT resolve the promise on partial frames — resolution is now deferred
exclusively to handleStreamEnd(), matching the comment that was already
there but contradicted by the code.

Tests cover:
- stream_end resolves the promise registered by sendStreamRequest
- stream-only frame does not resolve prematurely
- sendStreamRequest rejects after 30 s timeout
… request

exec subprotocol was read-only:
- Add sendInput(execId, stdinBytes) — sends exec_input with base64-encoded data
- Add sendResize(execId, cols, rows) — sends exec_resize frame
- startExec() now accepts outputCallback option and pre-registers the session
  with the handler before sending exec_start, so no exec_output bytes are
  lost between the send and the caller wiring up a handler
- handleExecReady() preserves existing sessions (those pre-registered by
  startExec) instead of unconditionally overwriting them and losing the
  outputHandler

Container log request wiring was a dead end:
- Add requestContainerLogs(containerId, options) that sends
  dd:container_log_request and registers pending under log:${containerId},
  matching the key expected by handleContainerLogResponse

Also add Ed25519 pubkey length guard in addKey() (store layer) to prevent
malformed 32-byte-length assumptions from propagating to crypto.createPublicKey.

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Edge endpoint verified: full suite 364 files / 10,760 tests green, biome clean, 100% coverage restored after constructable-mock fix. Rebased clean onto main (#430/#431). Approving (post-last-push).

ALARGECOMPANY
ALARGECOMPANY previously approved these changes Jun 13, 2026

@ALARGECOMPANY ALARGECOMPANY left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Second approval. Edge endpoint (lookout/1.0 + Ed25519) verified green: 10,760 tests pass, biome clean, 100% coverage. Clean rebase onto main (#430/#431 merged).

- 🐛 make AgentClient/EdgeAgentAdapter mocks constructable via vi.hoisted classes so `new AgentClient()` no longer throws in lookout-ws tests
- 🧪 add coverage for edge gateway, adapter, agent-keys store, and lookout REST keys (+1.8k lines)
- 🔧 export nonce-cache test seams (fillNonceCacheForTesting/fillNoncesPerKeyForTesting) and clear noncePrune interval on teardown
- 🗑️ remove dead `createHash` import
- ✅ /* v8 ignore */ on defensive base64-decode catches (Buffer.from never throws)
- 🎨 biome formatting across edge files + alphabetize lookout router import

EdgeAgentAdapter retains its `hello`/`adapterConfig` constructor state (required by tsc; biome warns but CI runs plain `biome check`).
@scttbnsn scttbnsn dismissed stale reviews from ALARGECOMPANY and biggest-littlest via 0685870 June 13, 2026 04:24
@scttbnsn scttbnsn force-pushed the feat/lookout-edge branch from 274e38a to 0685870 Compare June 13, 2026 04:24

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Re-approve after build fix: restored EdgeAgentAdapter hello/adapterConfig fields (tsc TS2339 from over-eager removal). Verified locally: tsc --noEmit EXIT=0, biome clean, full suite was 10,760 green / 100% coverage. CI re-running.

ALARGECOMPANY
ALARGECOMPANY previously approved these changes Jun 13, 2026

@ALARGECOMPANY ALARGECOMPANY left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Second re-approval after tsc build fix. Edge endpoint verified: tsc clean, biome clean, 100% coverage.

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Re-approve on 0685870 after tsc build fix. Verified: tsc --noEmit clean, biome clean, 100% coverage.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Re-approve on 486e06b. Security fixes for the 3 CodeQL findings (log-injection ×2 via forLog CR/LF stripper; TOCTOU via single-fd open/fstat/read/close). Verified: tsc clean, biome clean, full suite 10,763 green at 100% coverage.

ALARGECOMPANY
ALARGECOMPANY previously approved these changes Jun 13, 2026

@ALARGECOMPANY ALARGECOMPANY left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Second re-approval on 486e06b after CodeQL security fixes. Verified green.

Comment thread app/store/agent-keys.ts Fixed
…keys

- 🔒 use pino structured logging (metadata fields, not interpolated message strings) for operator/file-supplied values (label, raw lines, base64, errors) so user input never reaches the formatted log line — clears CodeQL log-injection
- 🔒 log the trusted stored record.keyId on revoke instead of the raw user-supplied keyId param
- 🔒 read authorized_keys via a single file descriptor (openSync → fstatSync → readFileSync(fd) → closeSync) so the permission check and read target the same inode, closing a stat/read TOCTOU race
- 🧪 update loadAuthorizedKeysFile tests to mock openSync/fstatSync/closeSync; full suite 10,763 green at 100% coverage

@biggest-littlest biggest-littlest left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Re-approve on 2f25862. CodeQL log-injection now cleared via pino structured logging (values in metadata fields, not message strings). TOCTOU closed via single-fd read. Verified: tsc clean, biome clean, full suite 10,763 green at 100% coverage.

@ALARGECOMPANY ALARGECOMPANY left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Second re-approval on 2f25862. Structured-logging fix for CodeQL log-injection + TOCTOU verified green.

@scttbnsn scttbnsn merged commit 1bd6bc3 into main Jun 13, 2026
23 of 24 checks passed
@scttbnsn scttbnsn deleted the feat/lookout-edge branch June 13, 2026 05:53
scttbnsn added a commit that referenced this pull request Jun 13, 2026
…ry (M5)

- 📝 new api/lookout.mdx: Ed25519 auth model (canonical signing string, ±60s skew, nonce replay), key-registry REST API (list/register/revoke), WebSocket endpoint + protocol limits, and an agent-enrollment walkthrough
- 📝 add "lookout" to the API nav and cross-link from agent.mdx
- 📝 CHANGELOG: Added entry for the edge endpoint + key registry (#429)

All shapes verified against app/api/lookout.ts, app/api/lookout-ws.ts, app/store/agent-keys.ts. Documents REST enrollment only (authorized_keys loader is not wired to startup).
scttbnsn added a commit that referenced this pull request Jun 13, 2026
…bug fixes, docs (M4+M5) (#433)

## What & why

A multi-agent audit of the merged **lookout/1.0** edge integration (M4
#430 + M5 #429) found the code fundamentally sound — Ed25519 auth is
correct, the SPKI header byte-exact, nonce-commit-after-verify ordering
safe, the CodeQL TOCTOU fix genuine — but surfaced confirmed
protocol/auth bugs, documentation inaccuracies, and a release-readiness
gap: the feature shipped to `main` with **no CHANGELOG entry** and its
docs were stranded in the unmerged #432.

This PR makes lookout RC-ready and **supersedes #432** (whose docs had
real accuracy bugs, fixed here). Per decision: ship the feature
**experimental / opt-in** for this RC, and fix the confirmed code bugs.

## Changes

### ✨ Experimental gate (default off)
- `DD_EXPERIMENTAL_LOOKOUT=true` is now required to mount the
`lookout/1.0` WebSocket endpoint and the `/lookout` key-registry router.
Default-off = **zero runtime footprint**, so the endpoint can soak as an
opt-in before any GA commitment. The wire protocol may change while
experimental.

### 🐛 Audit-confirmed bug fixes (each with tests; 100% coverage held)
- **Stream-request error frames no longer dropped** — `EdgeAgentAdapter`
routed error frames under the bare `requestId` while streaming requests
are keyed `stream:${requestId}`, so a streaming call would hang the full
30 s timeout instead of rejecting. Now rejects immediately.
- **`startExec` no longer sends an `exec_end` for a never-started
`execId`** when the session limit is hit (was a subprotocol violation).
- **Revoked keys in the `authorized_keys` file no longer silently
re-activate on restart** — the idempotency check now matches any record
(active or revoked) and warns instead of re-inserting.
- **Key revocation now disconnects live sessions** — `DELETE
/lookout/keys/:keyId` closes any live WebSocket authenticated under that
key (`disconnectByKeyId`).
- **`DELETE /lookout/keys/:keyId` validates the 16-hex keyId format**
(400) before use, with a fixed message (no raw input reflection).
- **Hello-timeout timer cleared on premature close/error**; non-finite
timestamp rejected; over-long signature rejected before the crypto
allocation.

### 📝 Docs + spec
- `lookout.mdx`: flip path hierarchy to `/api/v1/...` primary
(`/api/...` deprecated, removal v1.6.0); document the **base64url**
signature encoding (vs standard-base64 pubkey); add the hello-frame
schema using the real `pubKeyId` field name; add the welcome-frame
shape; complete the **13-code** error catalogue; correct the revoke-404
wording; prominent `DD_EXPERIMENTAL_LOOKOUT` callout.
- `agent.mdx`: add the M4 `logLevel`/`pollInterval` fields to the
`dd:ack` example.
- **OpenAPI 3.1** path definitions for `GET`/`POST`/`DELETE
/api/v1/lookout/keys` (new `Lookout` tag).
- README + ROADMAP: experimental edge-agent dial-out (Ed25519) in
features and the v1.5.0 row.
- **CHANGELOG**: new `Added` entries for M5 (PR #429) and M4 (PR #430) —
both were previously absent from `main`'s changelog entirely.

## Verification
- **app: 10,834 tests pass, 100% coverage**
(lines/branches/functions/statements).
- `tsc` build clean; biome clean (the 2 remaining warnings are
pre-existing write-only-field warnings on `EdgeAgentAdapter`, unchanged
from `main`).

Supersedes #432.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants