test(multi-pod): scaffold framework for multi-pod scenarios by viktormarinho · Pull Request #3391 · decocms/studio

viktormarinho · 2026-05-17T17:54:47Z

What is this contribution about?

Studio currently has no way to test multi-pod behavior. Bugs that only surface across pods — cross-pod `/attach`, DBOS workflow replay after pod death, NATS partition tolerance, session rehoming — can't be reproduced with `bun run dev` (single process) and aren't covered by the existing single-pod resilience suite. This PR adds a Docker-Compose-based framework that runs three mesh pods against shared Postgres + NATS, plus pod-pinned HTTP/SSE clients, per-pod `kill`/`restart` controls, a session-bootstrap helper, and a polling helper for distributed assertions.

A `cluster-smoke` scenario proves the foundation: each pod responds to `/health/live` and `/health/ready` independently, and a session created on pod-1 is recognized by pod-2 and pod-3. If any of those failed, every higher-level scenario would be untrustworthy. Migrations run once via a separate `migrate` service so the three mesh pods don't race; pods skip migration in their entrypoint. No scenarios that need an LLM yet — those will land in a follow-up PR after we agree on a mock-AI-provider strategy.

How to Test

Start Docker Desktop.
`./tests/multi-pod/run.sh` — builds the studio image (~5 min cold, cached after), brings up postgres + nats + 3 mesh pods, runs the smoke scenario, tears everything down.
Expected: `3 pass / 0 fail`. Cluster tears down on exit even if a test fails.

For iteration: `docker compose -f tests/multi-pod/docker-compose.yml up -d --build` once, then `bun test tests/multi-pod/scenarios/` repeatedly; the `registerTestHooks()` in each scenario waits for `/health/live` so it's race-free.

Migration Notes

None. Self-contained under `tests/multi-pod/`; doesn't change any application code.

Review Checklist

PR title is clear and descriptive
Changes are tested and working (3/3 smoke assertions pass locally)
Documentation is updated (if needed) — none yet; comments in each file explain the structure
No breaking changes

Summary by cubic

Adds a Docker Compose multi-pod test framework (3 mesh pods on shared Postgres + NATS) with smoke and cross-pod auth scenarios, pod-pinned clients, failure injection, distributed assertions, and CI to run on main and path-filtered PRs that touch mesh server code. Orchestration is hardened with a real NATS /healthz, a one-shot migrate, serialized first boot, and no auto-restart to make pod-death tests reliable.

New Features
- 3-pod cluster via Docker Compose with a one-shot migrate; mesh pods run with --skip-migrations. Runner: tests/multi-pod/run.sh; CI: .github/workflows/multi-pod.yml (runs on push to main, on PRs that modify server-side code via path filter, and workflow_dispatch; dumps logs on failure).
- Pod-pinned HTTP/SSE client with per-request auth; per-pod controls (kill/stop/start) and logs for failure injection.
- Session bootstrap (sign up → org → API key) on shared Postgres; scenarios: cluster-smoke, api-key-cross-pod (Bearer path), and session-rehoming (cookie invalidation propagates).
Bug Fixes
- NATS healthcheck uses monitoring /healthz on nats:2.10.22-alpine to avoid false positives.
- Startup hardened: serialized first boot (mesh-2 waits on mesh-1, mesh-3 on mesh-2), disabled Docker auto-restart (restart: "no"), and dropped docker compose --wait in favor of /health/live polling.
- Ignore .claude/*.lock to prevent accidental commits of local runtime locks.

^{Written for commit b04ff27. Summary will update on new commits. Review in cubic}

Three mesh pods (ports 13001/3) sharing one Postgres + NATS via Docker Compose. The framework gives tests: - Pod-pinned HTTP/SSE client (lib/client.ts) — pick which pod to hit; no LB, no sticky sessions, no guesswork. - Per-pod control (lib/pod.ts) — SIGKILL, restart, log inspection for failure-injection scenarios (pod-death recovery, etc.). - Session bootstrap (lib/setup.ts) — sign up, create org, mint API key in one call. Returns auth artifacts that work against any pod because Better Auth state lives in shared Postgres. - Polling helper (lib/poll-until.ts) — the only sane way to write distributed-systems assertions: "this becomes true within N seconds". The cluster-smoke scenario proves the foundation actually works: all three pods respond to /health/live and /health/ready independently, and a session created on pod-1 is recognized by pod-2 and pod-3. If any of those failed, every higher-level scenario would be untrustworthy. Migrations run once (separate `migrate` service) so the three mesh pods don't race each other; pods skip migration in their entrypoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

2 issues found across 10 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tests/multi-pod/docker-compose.yml">

<violation number="1" location="tests/multi-pod/docker-compose.yml:18">
P1: Check the NATS monitoring endpoint instead of `--help`; this healthcheck can pass without the server actually being ready.</violation>

<violation number="2" location="tests/multi-pod/docker-compose.yml:48">
P1: Pass `--skip-migrations` here; the pods still run the CLI migration phase on startup, so the separate migrate service doesn't actually eliminate concurrent migrations.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Re-trigger cubic}

NATS healthcheck switched from \`nats-server --help\` (passes even when the server isn't serving) to \`wget http://localhost:8222/healthz\` against the monitoring port we already enable with \`-m 8222\`. Requires the \`-alpine\` variant of the nats image so the container has a shell. Mesh pods now pass \`--skip-migrations\` to \`bun run src/cli.ts\`. The CLI runs the Kysely + Better Auth migration step on boot by default; without the flag, all three pods raced on the migration tables independently of the separate \`migrate\` service. The \`migrate\` service is now the sole source of truth for schema state, pods read-only at boot. DBOS still does its own (idempotent) system-schema setup on launch; those messages are expected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two reviewer-flagged issues, plus a third that surfaced once the first was fixed. 1. **restart: unless-stopped → "no"**. Docker restart policies are bypassed only by \`docker stop\`, not \`docker kill\`. With \`unless-stopped\`, a SIGKILL'd pod was auto-restarted within seconds, which would defeat any pod-death scenario. 2. **cluster.up() now matches run.sh**. Drops \`--wait\` and relies on \`waitReady()\` instead — \`--wait\` mis-classifies the one-shot \`migrate\` service as a failure. 3. **mesh-2 and mesh-3 wait for mesh-1 healthy**. Removing the restart policy exposed a real DBOS multi-pod boot race: even with \`--skip-migrations\` on mesh's CLI, DBOS still runs its own system- schema migrations on launch, and three parallel boots race on the \`dbos.dbos_migrations\` PK. The old restart loop was silently masking this by retrying until tables existed. Fix: serialize first boot via a depends_on chain. Once schemas exist, single-pod restarts mid-test are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /loop scheduling system writes a `.claude/scheduled_tasks.lock` sentinel during interactive sessions; without this rule it shows up as untracked and can be accidentally staged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two LLM-free scenarios that exercise the shared-Postgres contract along distinct code paths: - **session-rehoming**: sign out on pod-2 with a cookie minted on pod-1, verify the cookie is rejected on every pod within a 5s window. Guards against any future per-pod session cache that would let a signed-out user keep hitting a different pod. - **api-key-cross-pod**: mint an API key on pod-1, call the MCP `COLLECTION_THREADS_LIST` tool on every pod with the same Bearer key. Validates the Bearer → API-key-table lookup that the decopilot endpoints (POST /messages, GET /attach) will rely on once we add the cross-pod /attach scenario. CI runs on push to main + workflow_dispatch (matching the resilience workflow). The mesh and infra logs are dumped on failure so a CI regression is debuggable without re-running locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without this the workflow only ran after merge, which defeats the point of catching regressions before they land. Path filter keeps UI/docs/ plugin-only PRs free of the ~7-10 min cluster-boot cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 17, 2026

View reviewed changes

Comment thread tests/multi-pod/docker-compose.yml Outdated

Comment thread tests/multi-pod/docker-compose.yml Outdated

viktormarinho and others added 5 commits May 17, 2026 15:03

viktormarinho merged commit e539de9 into main May 17, 2026
12 checks passed

viktormarinho deleted the viktormarinho/multi-pod-tests branch May 17, 2026 18:33

viktormarinho mentioned this pull request May 17, 2026

test(multi-pod): cross-pod /attach + mock-ai provider scaffolding #3392

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(multi-pod): scaffold framework for multi-pod scenarios#3391

test(multi-pod): scaffold framework for multi-pod scenarios#3391
viktormarinho merged 6 commits into
mainfrom
viktormarinho/multi-pod-tests

viktormarinho commented May 17, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viktormarinho commented May 17, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this contribution about?

How to Test

Migration Notes

Review Checklist

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viktormarinho commented May 17, 2026 •

edited by cubic-dev-ai Bot

Loading