test(multi-pod): scaffold framework for multi-pod scenarios#3391
Merged
Conversation
Three mesh pods (ports 13001/3) sharing one Postgres + NATS via Docker Compose. The framework gives tests: - Pod-pinned HTTP/SSE client (lib/client.ts) — pick which pod to hit; no LB, no sticky sessions, no guesswork. - Per-pod control (lib/pod.ts) — SIGKILL, restart, log inspection for failure-injection scenarios (pod-death recovery, etc.). - Session bootstrap (lib/setup.ts) — sign up, create org, mint API key in one call. Returns auth artifacts that work against any pod because Better Auth state lives in shared Postgres. - Polling helper (lib/poll-until.ts) — the only sane way to write distributed-systems assertions: "this becomes true within N seconds". The cluster-smoke scenario proves the foundation actually works: all three pods respond to /health/live and /health/ready independently, and a session created on pod-1 is recognized by pod-2 and pod-3. If any of those failed, every higher-level scenario would be untrustworthy. Migrations run once (separate `migrate` service) so the three mesh pods don't race each other; pods skip migration in their entrypoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
2 issues found across 10 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="tests/multi-pod/docker-compose.yml">
<violation number="1" location="tests/multi-pod/docker-compose.yml:18">
P1: Check the NATS monitoring endpoint instead of `--help`; this healthcheck can pass without the server actually being ready.</violation>
<violation number="2" location="tests/multi-pod/docker-compose.yml:48">
P1: Pass `--skip-migrations` here; the pods still run the CLI migration phase on startup, so the separate migrate service doesn't actually eliminate concurrent migrations.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic
NATS healthcheck switched from \`nats-server --help\` (passes even when the server isn't serving) to \`wget http://localhost:8222/healthz\` against the monitoring port we already enable with \`-m 8222\`. Requires the \`-alpine\` variant of the nats image so the container has a shell. Mesh pods now pass \`--skip-migrations\` to \`bun run src/cli.ts\`. The CLI runs the Kysely + Better Auth migration step on boot by default; without the flag, all three pods raced on the migration tables independently of the separate \`migrate\` service. The \`migrate\` service is now the sole source of truth for schema state, pods read-only at boot. DBOS still does its own (idempotent) system-schema setup on launch; those messages are expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reviewer-flagged issues, plus a third that surfaced once the first was fixed. 1. **restart: unless-stopped → "no"**. Docker restart policies are bypassed only by \`docker stop\`, not \`docker kill\`. With \`unless-stopped\`, a SIGKILL'd pod was auto-restarted within seconds, which would defeat any pod-death scenario. 2. **cluster.up() now matches run.sh**. Drops \`--wait\` and relies on \`waitReady()\` instead — \`--wait\` mis-classifies the one-shot \`migrate\` service as a failure. 3. **mesh-2 and mesh-3 wait for mesh-1 healthy**. Removing the restart policy exposed a real DBOS multi-pod boot race: even with \`--skip-migrations\` on mesh's CLI, DBOS still runs its own system- schema migrations on launch, and three parallel boots race on the \`dbos.dbos_migrations\` PK. The old restart loop was silently masking this by retrying until tables existed. Fix: serialize first boot via a depends_on chain. Once schemas exist, single-pod restarts mid-test are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /loop scheduling system writes a `.claude/scheduled_tasks.lock` sentinel during interactive sessions; without this rule it shows up as untracked and can be accidentally staged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two LLM-free scenarios that exercise the shared-Postgres contract along distinct code paths: - **session-rehoming**: sign out on pod-2 with a cookie minted on pod-1, verify the cookie is rejected on every pod within a 5s window. Guards against any future per-pod session cache that would let a signed-out user keep hitting a different pod. - **api-key-cross-pod**: mint an API key on pod-1, call the MCP `COLLECTION_THREADS_LIST` tool on every pod with the same Bearer key. Validates the Bearer → API-key-table lookup that the decopilot endpoints (POST /messages, GET /attach) will rely on once we add the cross-pod /attach scenario. CI runs on push to main + workflow_dispatch (matching the resilience workflow). The mesh and infra logs are dumped on failure so a CI regression is debuggable without re-running locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this the workflow only ran after merge, which defeats the point of catching regressions before they land. Path filter keeps UI/docs/ plugin-only PRs free of the ~7-10 min cluster-boot cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is this contribution about?
Studio currently has no way to test multi-pod behavior. Bugs that only surface across pods — cross-pod `/attach`, DBOS workflow replay after pod death, NATS partition tolerance, session rehoming — can't be reproduced with `bun run dev` (single process) and aren't covered by the existing single-pod resilience suite. This PR adds a Docker-Compose-based framework that runs three mesh pods against shared Postgres + NATS, plus pod-pinned HTTP/SSE clients, per-pod `kill`/`restart` controls, a session-bootstrap helper, and a polling helper for distributed assertions.
A `cluster-smoke` scenario proves the foundation: each pod responds to `/health/live` and `/health/ready` independently, and a session created on pod-1 is recognized by pod-2 and pod-3. If any of those failed, every higher-level scenario would be untrustworthy. Migrations run once via a separate `migrate` service so the three mesh pods don't race; pods skip migration in their entrypoint. No scenarios that need an LLM yet — those will land in a follow-up PR after we agree on a mock-AI-provider strategy.
How to Test
For iteration: `docker compose -f tests/multi-pod/docker-compose.yml up -d --build` once, then `bun test tests/multi-pod/scenarios/` repeatedly; the `registerTestHooks()` in each scenario waits for `/health/live` so it's race-free.
Migration Notes
None. Self-contained under `tests/multi-pod/`; doesn't change any application code.
Review Checklist
Summary by cubic
Adds a Docker Compose multi-pod test framework (3 mesh pods on shared Postgres + NATS) with smoke and cross-pod auth scenarios, pod-pinned clients, failure injection, distributed assertions, and CI to run on
mainand path-filtered PRs that touch mesh server code. Orchestration is hardened with a real NATS/healthz, a one-shotmigrate, serialized first boot, and no auto-restart to make pod-death tests reliable.New Features
migrate; mesh pods run with--skip-migrations. Runner:tests/multi-pod/run.sh; CI:.github/workflows/multi-pod.yml(runs on push tomain, on PRs that modify server-side code via path filter, andworkflow_dispatch; dumps logs on failure).kill/stop/start) and logs for failure injection.cluster-smoke,api-key-cross-pod(Bearer path), andsession-rehoming(cookie invalidation propagates).Bug Fixes
/healthzonnats:2.10.22-alpineto avoid false positives.mesh-2waits onmesh-1,mesh-3onmesh-2), disabled Docker auto-restart (restart: "no"), and droppeddocker compose --waitin favor of/health/livepolling..claude/*.lockto prevent accidental commits of local runtime locks.Written for commit b04ff27. Summary will update on new commits. Review in cubic