Skip to content

test(multi-pod): scaffold framework for multi-pod scenarios#3391

Merged
viktormarinho merged 6 commits into
mainfrom
viktormarinho/multi-pod-tests
May 17, 2026
Merged

test(multi-pod): scaffold framework for multi-pod scenarios#3391
viktormarinho merged 6 commits into
mainfrom
viktormarinho/multi-pod-tests

Conversation

@viktormarinho
Copy link
Copy Markdown
Contributor

@viktormarinho viktormarinho commented May 17, 2026

What is this contribution about?

Studio currently has no way to test multi-pod behavior. Bugs that only surface across pods — cross-pod `/attach`, DBOS workflow replay after pod death, NATS partition tolerance, session rehoming — can't be reproduced with `bun run dev` (single process) and aren't covered by the existing single-pod resilience suite. This PR adds a Docker-Compose-based framework that runs three mesh pods against shared Postgres + NATS, plus pod-pinned HTTP/SSE clients, per-pod `kill`/`restart` controls, a session-bootstrap helper, and a polling helper for distributed assertions.

A `cluster-smoke` scenario proves the foundation: each pod responds to `/health/live` and `/health/ready` independently, and a session created on pod-1 is recognized by pod-2 and pod-3. If any of those failed, every higher-level scenario would be untrustworthy. Migrations run once via a separate `migrate` service so the three mesh pods don't race; pods skip migration in their entrypoint. No scenarios that need an LLM yet — those will land in a follow-up PR after we agree on a mock-AI-provider strategy.

How to Test

  1. Start Docker Desktop.
  2. `./tests/multi-pod/run.sh` — builds the studio image (~5 min cold, cached after), brings up postgres + nats + 3 mesh pods, runs the smoke scenario, tears everything down.
  3. Expected: `3 pass / 0 fail`. Cluster tears down on exit even if a test fails.

For iteration: `docker compose -f tests/multi-pod/docker-compose.yml up -d --build` once, then `bun test tests/multi-pod/scenarios/` repeatedly; the `registerTestHooks()` in each scenario waits for `/health/live` so it's race-free.

Migration Notes

None. Self-contained under `tests/multi-pod/`; doesn't change any application code.

Review Checklist

  • PR title is clear and descriptive
  • Changes are tested and working (3/3 smoke assertions pass locally)
  • Documentation is updated (if needed) — none yet; comments in each file explain the structure
  • No breaking changes

Summary by cubic

Adds a Docker Compose multi-pod test framework (3 mesh pods on shared Postgres + NATS) with smoke and cross-pod auth scenarios, pod-pinned clients, failure injection, distributed assertions, and CI to run on main and path-filtered PRs that touch mesh server code. Orchestration is hardened with a real NATS /healthz, a one-shot migrate, serialized first boot, and no auto-restart to make pod-death tests reliable.

  • New Features

    • 3-pod cluster via Docker Compose with a one-shot migrate; mesh pods run with --skip-migrations. Runner: tests/multi-pod/run.sh; CI: .github/workflows/multi-pod.yml (runs on push to main, on PRs that modify server-side code via path filter, and workflow_dispatch; dumps logs on failure).
    • Pod-pinned HTTP/SSE client with per-request auth; per-pod controls (kill/stop/start) and logs for failure injection.
    • Session bootstrap (sign up → org → API key) on shared Postgres; scenarios: cluster-smoke, api-key-cross-pod (Bearer path), and session-rehoming (cookie invalidation propagates).
  • Bug Fixes

    • NATS healthcheck uses monitoring /healthz on nats:2.10.22-alpine to avoid false positives.
    • Startup hardened: serialized first boot (mesh-2 waits on mesh-1, mesh-3 on mesh-2), disabled Docker auto-restart (restart: "no"), and dropped docker compose --wait in favor of /health/live polling.
    • Ignore .claude/*.lock to prevent accidental commits of local runtime locks.

Written for commit b04ff27. Summary will update on new commits. Review in cubic

Three mesh pods (ports 13001/3) sharing one Postgres + NATS via Docker
Compose. The framework gives tests:

- Pod-pinned HTTP/SSE client (lib/client.ts) — pick which pod to hit;
  no LB, no sticky sessions, no guesswork.
- Per-pod control (lib/pod.ts) — SIGKILL, restart, log inspection for
  failure-injection scenarios (pod-death recovery, etc.).
- Session bootstrap (lib/setup.ts) — sign up, create org, mint API key
  in one call. Returns auth artifacts that work against any pod because
  Better Auth state lives in shared Postgres.
- Polling helper (lib/poll-until.ts) — the only sane way to write
  distributed-systems assertions: "this becomes true within N seconds".

The cluster-smoke scenario proves the foundation actually works: all
three pods respond to /health/live and /health/ready independently, and
a session created on pod-1 is recognized by pod-2 and pod-3. If any of
those failed, every higher-level scenario would be untrustworthy.

Migrations run once (separate `migrate` service) so the three mesh pods
don't race each other; pods skip migration in their entrypoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 10 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tests/multi-pod/docker-compose.yml">

<violation number="1" location="tests/multi-pod/docker-compose.yml:18">
P1: Check the NATS monitoring endpoint instead of `--help`; this healthcheck can pass without the server actually being ready.</violation>

<violation number="2" location="tests/multi-pod/docker-compose.yml:48">
P1: Pass `--skip-migrations` here; the pods still run the CLI migration phase on startup, so the separate migrate service doesn't actually eliminate concurrent migrations.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic

Comment thread tests/multi-pod/docker-compose.yml Outdated
Comment thread tests/multi-pod/docker-compose.yml Outdated
viktormarinho and others added 5 commits May 17, 2026 15:03
NATS healthcheck switched from \`nats-server --help\` (passes even when
the server isn't serving) to \`wget http://localhost:8222/healthz\`
against the monitoring port we already enable with \`-m 8222\`. Requires
the \`-alpine\` variant of the nats image so the container has a shell.

Mesh pods now pass \`--skip-migrations\` to \`bun run src/cli.ts\`. The CLI
runs the Kysely + Better Auth migration step on boot by default; without
the flag, all three pods raced on the migration tables independently of
the separate \`migrate\` service. The \`migrate\` service is now the sole
source of truth for schema state, pods read-only at boot.

DBOS still does its own (idempotent) system-schema setup on launch;
those messages are expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two reviewer-flagged issues, plus a third that surfaced once the first
was fixed.

1. **restart: unless-stopped → "no"**. Docker restart policies are
   bypassed only by \`docker stop\`, not \`docker kill\`. With
   \`unless-stopped\`, a SIGKILL'd pod was auto-restarted within seconds,
   which would defeat any pod-death scenario.

2. **cluster.up() now matches run.sh**. Drops \`--wait\` and relies on
   \`waitReady()\` instead — \`--wait\` mis-classifies the one-shot
   \`migrate\` service as a failure.

3. **mesh-2 and mesh-3 wait for mesh-1 healthy**. Removing the restart
   policy exposed a real DBOS multi-pod boot race: even with
   \`--skip-migrations\` on mesh's CLI, DBOS still runs its own system-
   schema migrations on launch, and three parallel boots race on the
   \`dbos.dbos_migrations\` PK. The old restart loop was silently masking
   this by retrying until tables existed. Fix: serialize first boot via
   a depends_on chain. Once schemas exist, single-pod restarts mid-test
   are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /loop scheduling system writes a `.claude/scheduled_tasks.lock`
sentinel during interactive sessions; without this rule it shows up as
untracked and can be accidentally staged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two LLM-free scenarios that exercise the shared-Postgres contract along
distinct code paths:

- **session-rehoming**: sign out on pod-2 with a cookie minted on pod-1,
  verify the cookie is rejected on every pod within a 5s window. Guards
  against any future per-pod session cache that would let a signed-out
  user keep hitting a different pod.

- **api-key-cross-pod**: mint an API key on pod-1, call the MCP
  `COLLECTION_THREADS_LIST` tool on every pod with the same Bearer key.
  Validates the Bearer → API-key-table lookup that the decopilot
  endpoints (POST /messages, GET /attach) will rely on once we add the
  cross-pod /attach scenario.

CI runs on push to main + workflow_dispatch (matching the resilience
workflow). The mesh and infra logs are dumped on failure so a CI
regression is debuggable without re-running locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this the workflow only ran after merge, which defeats the point
of catching regressions before they land. Path filter keeps UI/docs/
plugin-only PRs free of the ~7-10 min cluster-boot cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@viktormarinho viktormarinho merged commit e539de9 into main May 17, 2026
12 checks passed
@viktormarinho viktormarinho deleted the viktormarinho/multi-pod-tests branch May 17, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant