feat(auth): per-sandbox authentication to gateway by TaylorMutch · Pull Request #1404 · NVIDIA/OpenShell

TaylorMutch · 2026-05-15T16:16:25Z

Summary

Adds per-sandbox supervisor authentication for gateway RPCs and closes the cross-sandbox access gap tracked in #1354. Sandbox supervisors now authenticate as a specific Principal::Sandbox; gateway handlers then enforce that the authenticated sandbox matches the sandbox named in each sandbox-scoped request.

The design has two first-class bootstrap patterns:

Docker, Podman, and VM sandboxes receive gateway-minted JWT bootstrap material through driver-managed supervisor secret files.
Kubernetes sandboxes exchange projected ServiceAccount identity for the same kind of gateway-minted JWT.

After bootstrap, both patterns converge on the same steady-state behavior: the supervisor presents Authorization: Bearer <gateway-jwt>, refreshes that credential in memory, and is authorized only for its own sandbox.

Related Issue

Closes #1354

Changes

Introduces Authenticator/Principal routing for gateway gRPC authentication.
Adds gateway-minted sandbox JWT signing, validation, revocation, and refresh support.
Adds Docker/Podman/VM bootstrap plumbing that delivers supervisor-only JWT files without exposing tokens through public APIs or user entrypoint environments.
Adds Kubernetes ServiceAccount token bootstrap validation for IssueSandboxToken.
Updates the supervisor gRPC client to acquire a bearer credential at startup and inject it on every gateway call.
Enforces per-handler sandbox ID equality for sandbox-scoped RPCs.
Adds sandbox debug-rpc helpers for end-to-end authentication testing.
Mounts sandbox JWT keys in Helm deployments even when local TLS is disabled.
Updates docs and debugging guidance for the new per-sandbox identity model.

Implementation Details

Problem Context

Before this PR, sandbox-class handlers trusted a sandbox_id or sandbox name supplied in the request body. The shared mTLS client certificate only proved that the caller had a gateway client certificate; it did not prove that the caller was sandbox A rather than sandbox B. Any holder of that shared credential could therefore ask for another sandbox's policy, drafts, provider environment, or related sandbox-private state.

This PR moves the identity decision into the gateway authentication layer. The router authenticates the caller, inserts a Principal into request extensions, and handlers compare that principal to the requested sandbox before serving sandbox-private data.

The detailed implementation plan is captured in architecture/plans/sandbox-service-accounts-implementation.md.

Shared Gateway Auth Model

The gateway now uses a pluggable authenticator chain. Each authenticator can produce a Principal, decline so the next authenticator can try, or reject the request fail-closed.

The steady-state sandbox credential is a gateway-minted Ed25519 JWT. Validation checks issuer, audience, key ID, expiry, algorithm, and revocation state. The JWT includes sandbox identity and a jti so refresh and delete can invalidate previous tokens.

This JWT is supervisor identity material:

It is not returned in CreateSandboxResponse.
It is not stored in public sandbox metadata.
It is not logged.
It is kept out of ordinary user entrypoint environments.

Docker, Podman, And VM Bootstrap

Docker, Podman, and VM deployments do not have a platform identity service equivalent to Kubernetes projected ServiceAccount tokens. For those drivers, the gateway uses a push-based bootstrap pattern.

At sandbox creation time, the gateway mints a sandbox JWT for the new sandbox and passes it to the in-process driver boundary as secret material. The driver writes that token to a supervisor-only file and starts the sandbox with OPENSHELL_SANDBOX_TOKEN_FILE pointing at that file. The supervisor reads the file once at startup and then keeps the active token in memory.

This mirrors the existing file-based secret delivery pattern used by local drivers while avoiding the unsafe parts of the old model:

The raw token does not cross the public gRPC API.
The token is not placed in the user command environment.
The token is scoped to one sandbox ID.
Refresh rotates the in-memory bearer token without rewriting the bootstrap file.

Podman follows the same path as Docker. The VM path uses the same concept with the token embedded into the guest secret material at sandbox start, then refreshed in memory after the supervisor is running.

This path is the primary singleplayer/local-driver design, not a fallback from Kubernetes.

Kubernetes Bootstrap

Kubernetes uses a pull-based bootstrap pattern because kubelet can provide a short-lived, audience-bound ServiceAccount token to the sandbox pod.

The sandbox pod gets a projected ServiceAccount token mounted at a supervisor-only path. On startup, the supervisor presents that token to IssueSandboxToken. The gateway validates the ServiceAccount token, extracts pod identity claims, fetches the pod, and reads the gateway-owned openshell.io/sandbox-id annotation to derive the sandbox identity. If the checks pass, the gateway returns the same kind of gateway-minted sandbox JWT used by the Docker/Podman/VM path.

This avoids creating one Kubernetes Secret per sandbox. The gateway RBAC is intentionally narrow: validate token reviews and read pods in the sandbox namespace. It does not need to patch sandbox pods, and operators should avoid granting extra pod mutation permissions to the gateway identity.

Supervisor Credential Resolution

The supervisor resolves credentials in a driver-neutral order:

OPENSHELL_SANDBOX_TOKEN for tests.
OPENSHELL_SANDBOX_TOKEN_FILE for Docker, Podman, and VM.
OPENSHELL_K8S_SA_TOKEN_FILE for Kubernetes bootstrap through IssueSandboxToken.

Once resolved, every path produces a gateway-minted JWT in the same token slot. A gRPC interceptor injects it as Authorization: Bearer on all gateway calls. Refresh updates that shared slot, so existing clients do not need to be rebuilt when the token rotates.

Handler Authorization

Authentication alone is not enough; handlers still need to authorize access to the requested sandbox.

Direct sandbox_id handlers compare the authenticated Principal::Sandbox.sandbox_id to the requested ID. Name-keyed handlers resolve the sandbox name to the canonical ID and then compare. Streaming log push authorizes on the first frame, where the sandbox identity is declared.

User principals continue through the normal RBAC path. Sandbox principals are limited to their own sandbox. Anonymous principals are rejected for sandbox-scoped paths.

Refresh And Revocation

RefreshSandboxToken lets a supervisor rotate its in-memory gateway JWT before expiry. The gateway mints a replacement for the same sandbox principal and revokes the previous jti. Sandbox deletion also revokes the most recent token so replayed credentials are rejected.

Kubernetes supervisors can recover from restart by repeating the ServiceAccount bootstrap exchange. Docker, Podman, and VM supervisors use their file token as bootstrap material and then rely on in-memory refresh for steady state.

Signing Key Persistence

The gateway JWT signing key is persisted through the existing local and Helm PKI paths. Helm mounts the JWT key material into the gateway even when local TLS is disabled, because per-sandbox authentication is independent from TLS enablement.

Design Decisions For Reviewers

Two bootstrap patterns, one steady-state credential. Docker/Podman/VM push a supervisor-only bootstrap token file; Kubernetes pulls a token through ServiceAccount exchange. Both become the same gateway JWT.
No per-sandbox Kubernetes Secret objects. Kubernetes uses projected tokens and IssueSandboxToken.
No raw token in public APIs. Tokens stay out of protos, CreateSandboxResponse, sandbox metadata, ordinary user environments, and logs.
mTLS is not sandbox identity. mTLS can still protect transport, but sandbox authorization is based on an authenticated sandbox principal.
Handler checks are explicit. The router authenticates; handlers authorize sandbox scope because they know which request field identifies the target sandbox.
Revocation is jti-based. Refresh and delete invalidate previous tokens without changing the stable sandbox ID.

Reviewer Focus Areas

Docker/Podman/VM token file handling: supervisor-only placement, no leakage into entrypoint environment, and correct cleanup behavior.
Kubernetes bootstrap validation: ServiceAccount audience, pod lookup, annotation handling, and RBAC scope.
Handler coverage: every sandbox-private RPC should either call the sandbox-scope guard or have a documented reason not to.
Streaming RPC behavior: PushSandboxLogs authorizes on the first frame.
Signing key persistence: local and Helm deployments must preserve the JWT key across gateway restarts; multi-replica gateways must share the same key material.
Refresh/revocation edge cases: old jti rejection after refresh and sandbox delete.

Testing

mise run pre-commit passes
Unit tests added/updated for authenticator chain behavior, sandbox JWT validation, revocation, handler guards, token acquisition, refresh timing, and driver env/secret plumbing
E2E tests added/updated for sandbox identity and cross-sandbox denial
Helm dev smoke test with sandbox list, sandbox create, and sandbox delete against a local k3d deployment

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-05-15T16:16:29Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Replaces the hard-coded sandbox-method / dual-auth / Bearer branches in AuthGrpcRouter with a pluggable Authenticator chain that produces a Principal::{User, Sandbox, Anonymous}. The principal is inserted into request extensions for handler consumption. PR-1 keeps the legacy metadata marker for sandbox principals so existing handlers that read x-openshell-auth-source continue to work; the marker is removed in the PR-3 wire break. The OidcAuthenticator wraps the existing JwksCache::validate_token for User principals, and the LegacySandboxMarkerAuthenticator preserves the pre-refactor path-based behavior pending the gateway-minted JWT flow in PR 2/3. Part of the per-sandbox identity series that closes #1354.

Adds the gateway-side infrastructure for per-sandbox identity tokens (the PR-2 step of the series resolving #1354): - New Ed25519 keypair generated by `certgen` alongside the existing PKI. Local mode writes `<dir>/jwt/{signing.pem,public.pem,kid}`; K8s mode creates an Opaque `<release>-jwt-keys` Secret. - `SandboxJwtIssuer` mints tokens with EdDSA-signed claims (SPIFFE-shaped `sub`, denormalised `sandbox_id`, 24h default TTL, `jti` for revocation). - `SandboxJwtAuthenticator` validates tokens through the Authenticator chain and yields `Principal::Sandbox(BootstrapJwt {..})`. Tokens with a different `kid` fall through so non-matching Bearer headers reach the OIDC authenticator unchanged. - `K8sServiceAccountAuthenticator` is path-scoped to `IssueSandboxToken`; consumes a projected SA token and produces a `K8sServiceAccount` sandbox principal that the new `IssueSandboxToken` handler exchanges for a fresh gateway JWT. - In-memory `RevocationSet` with TTL pruning, ready for the PR-3 delete-side hook and PR-5 refresh. - Helm chart mounts the JWT secret on the gateway pod and wires `[openshell.gateway.gateway_jwt]` into the rendered TOML. PR 2 is additive: no driver yet writes a sandbox token, no supervisor yet presents a Bearer JWT. PR 3 wires the consumer ends and removes the legacy path-based sandbox marker.

Switches every sandbox-to-gateway gRPC call from "path-based mTLS-only trust" to "Authorization: Bearer <gateway-minted-JWT>" presented by the sandbox supervisor. Closes the trust-boundary half of issue #1354; the per-handler sandbox_id equality check follows in PR 4. Sandbox side: - crates/openshell-sandbox/src/grpc_client.rs gains an AuthInterceptor that injects the Bearer header on every outbound RPC. The token is resolved at startup from one of three sources, in order: 1. OPENSHELL_SANDBOX_TOKEN (env, test harnesses) 2. OPENSHELL_SANDBOX_TOKEN_FILE (Docker/Podman/VM drivers) 3. OPENSHELL_K8S_SA_TOKEN_FILE (K8s driver — projected SA token exchanged for a gateway JWT via IssueSandboxToken) Gateway side: - handle_create_sandbox mints a gateway JWT and passes it through the compute layer to DriverSandboxSpec.sandbox_token. K8s sandboxes ignore the field; Docker and Podman drivers inject it as OPENSHELL_SANDBOX_TOKEN in the container env. - Removes the path-based SANDBOX_METHODS / DUAL_AUTH_METHODS branches and the x-openshell-auth-source metadata marker. The AuthGrpcRouter chain is now uniform: K8s SA -> SandboxJwt -> OIDC, all extension-based. - Removes LegacySandboxMarkerAuthenticator and the SandboxIdentitySource:: LegacyMarker variant. Handlers read Principal::Sandbox directly from request extensions. Kubernetes driver: - Sandbox pods gain a projected ServiceAccount token volume mounted at /var/run/secrets/openshell/token (audience openshell-gateway, 1h TTL, kubelet auto-rotates). - Each pod is annotated with openshell.io/sandbox-id; the gateway resolves the SA token claim's pod uid back to a sandbox id via this annotation. - Helm Role grants the gateway pods:get in the sandbox namespace. No ClusterRoleBinding to system:auth-delegator — the gateway validates SA tokens against the apiserver's anonymous JWKS endpoint instead of via TokenReview, so no cluster-scoped privilege is required. The full JWKS verifier + pod-annotation lookup lands in the follow-up that brings the K8s helm-dev demo end-to-end; PR 3 exercises the wire break with Docker/Podman as the working drivers.

ProcessHandle::spawn_impl previously inherited the supervisor's full environment when starting the sandbox entrypoint, then drop_privileges() demoted the child to the sandbox user. The combination meant a later process running as the sandbox user (e.g. an SSH-spawned shell) could read /proc/<entrypoint_pid>/environ and recover the gateway-minted JWT. Explicitly env_remove the three sandbox-token env vars before exec so the entrypoint child carries none of the supervisor's identity material. SSH session shells already use env_clear() in apply_child_env, so this plugs the only remaining inheritance path. Related to #1354 (per-sandbox identity series, PR 3 follow-up).

Adds the IDOR guard that closes the second half of the per-sandbox identity series. Every sandbox-class handler now verifies that the calling Principal::Sandbox.sandbox_id matches the canonical UUID the request body operates on. User principals bypass the check because RBAC was their gate at the router layer; anonymous callers are rejected outright. New module crates/openshell-server/src/auth/guard.rs exposes ensure_sandbox_scope / enforce_sandbox_scope. Applied at the top of: - handle_get_sandbox_config (id-keyed) - handle_get_sandbox_provider_environment (id-keyed) - handle_report_policy_status (id-keyed) - handle_push_sandbox_logs (id-keyed, first frame only — principal is stable across the stream) - handle_submit_policy_analysis (name-keyed: resolve to id, then check) - handle_get_draft_policy (name-keyed) - handle_update_config (dual-auth: enforce only when Principal::Sandbox; CLI / TUI user paths are unaffected) - handle_get_inference_bundle (no sandbox_id in body; accept any authenticated principal, reject anonymous) Existing policy.rs tests are updated to wrap their requests with a test-helper user principal so the new guard treats them as CLI calls; six new tests cover the cross-sandbox-denied / same-sandbox-allowed / user-bypasses-guard matrix.

Adds the rotation half of the per-sandbox identity series. Sandboxes holding a valid gateway-minted JWT can swap it for a fresh one without disruption; the old jti is revoked server-side before the new token is handed back, so a leaked token is unusable as soon as the rotation completes. Server side: - proto/openshell.proto gains RefreshSandboxToken plus empty request / token+expires_at_ms response messages. - handle_refresh_sandbox_token requires Principal::Sandbox with a BootstrapJwt source (K8s-SA principals are routed to IssueSandboxToken for bootstrap; user principals are rejected). The handler mints the replacement token first, then adds the old jti to the in-memory RevocationSet — so a failed mint never strands the sandbox. Sandbox side: - AuthInterceptor now reads its Bearer header from a process-wide Arc<RwLock<AsciiMetadataValue>> slot, so a single in-place token rotation is visible to every cached client (CachedOpenShellClient, the supervisor session channel, log push, etc.). - connect_channel spawns a background refresh loop once per process that sleeps for ~80% of the token's remaining lifetime (clamped to 60s-12h, plus small deterministic jitter) and calls RefreshSandboxToken, updating the token slot on success. - New parse_jwt_exp_ms helper decodes the JWT payload without signature verification — the token's origin is already trusted via the acquisition flow. Tests: - 4 server-side handler tests (round-trip, user-principal rejected, K8s-SA-principal rejected, missing-issuer returns Unavailable) - 3 sandbox-side helper tests (parse-exp, 80%-of-TTL delay, 60s floor) All existing OpenShell test impls gain a refresh_sandbox_token stub.

The projected SA token kubelet writes to each sandbox pod was previously a hardcoded 3600s literal in the driver. Operators in tighter audit regimes want to dial it lower; very large clusters may want it slightly higher to absorb token-refresh churn. Wires `sa_token_ttl_secs` through three layers: - KubernetesComputeConfig gains the field (default 3600). The driver clamps to [600, 86400] via `effective_sa_token_ttl_secs()`: 600s is kubelet's enforced minimum, 24h is the cap (the token is consumed within seconds of pod start, so longer is almost always a misconfiguration). - The openshell-driver-kubernetes binary exposes `--sa-token-ttl-secs` / `OPENSHELL_K8S_SA_TOKEN_TTL_SECS`. - `[openshell.gateway].sa_token_ttl_secs` in the gateway TOML inherits into `[openshell.drivers.kubernetes]`, mirroring the `enable_user_namespaces` plumbing. - Helm: `server.sandboxJwt.k8sSaTokenTtlSecs` (default 3600) renders into the K8s driver block of the gateway config.

Replaces the LiveK8sResolver stub with a working validator. Sandbox pods present their projected ServiceAccount token via Authorization: Bearer on IssueSandboxToken; the gateway: 1. Decodes the JWT header and looks up the signing key. 2. On miss, fetches the apiserver's /.well-known/openid-configuration discovery doc + /openid/v1/jwks via kube::Client and caches the keys. 3. Validates the token's signature (RS256), issuer, audience (openshell-gateway), and expiry. 4. Reads `kubernetes.io.pod.{name,uid}` from the claims and GETs the pod in the gateway's sandbox namespace. 5. Verifies the live pod's UID matches the token's UID (defense against replayed tokens from recreated pods with the same name) and reads the openshell.io/sandbox-id annotation to derive the sandbox UUID. The gateway needs no system:auth-delegator ClusterRoleBinding — JWKS validation is local, so the only K8s permission it consumes is the namespace Role's `pods: get` grant. Discovery + JWKS reads ride the gateway's existing kube::Client auth (system:service-account-issuer- discovery is bound to system:authenticated in every supported K8s distro). ServerState gains an in-cluster detection path in run_server: when KUBERNETES_SERVICE_HOST is set AND a sandbox JWT issuer is configured, construct the resolver and wire it as state.k8s_sa_authenticator. The existing K8sServiceAccountAuthenticator (path-scoped to IssueSandboxToken) becomes functional. Tests: JWKS path parsing covers absolute URL, relative path, query string, and garbage rejection. End-to-end validation against a real apiserver is exercised in the helm-dev demo.

Three regressions / inefficiencies surfaced while bringing the per-sandbox identity series up end-to-end in the local helm cluster: 1. CLI returned Unauthenticated against a no-OIDC dev gateway. PR 3 removed the pre-refactor "no OIDC = pass through" behavior; with only sandbox-side authenticators in the chain, plain user CLI calls hit Unauthenticated. Add a PermissiveUserAuthenticator that installs as a final fallback when no OIDC is configured but sandbox JWT signing IS — produces a synthetic dev-anonymous user principal so the rest of the handler chain treats CLI calls as User and bypasses the IDOR guard. Production OIDC deployments are unaffected: when OIDC is configured the fallback is not installed and missing-Bearer still 401s. 2. Sandbox supervisor re-ran the K8s SA bootstrap exchange on every connect_channel() call. With multiple subsystems each building their own channels, IssueSandboxToken was firing every few seconds even though TOKEN_SLOT already had a fresh token. Change connect_channel to reuse TOKEN_SLOT when populated; only run acquire_sandbox_token on the first call per process. The refresh loop keeps the slot fresh thereafter. 3. K8s SA authenticator looked up sandbox pods in the gateway's own namespace (POD_NAMESPACE) instead of the K8s driver's configured sandbox namespace. Source from kubernetes_config_from_file() so the resolver targets the same namespace the driver creates pods in. Verified end-to-end against the helm-dev cluster: - Two sandboxes get distinct gateway JWTs with their own sandbox UUIDs. - Cross-sandbox GetSandboxConfig is rejected with PermissionDenied and the auth::guard audit log fires with both principal and requested IDs. - RefreshSandboxToken mints a new JWT and revokes the old jti; the old token is then rejected with Unauthenticated: revoked token.

…testing Adds a small subcommand to the supervisor binary that issues one-shot sandbox-class RPCs against the gateway using the supervisor's existing token-acquisition pipeline. Designed to be invoked via docker exec or kubectl exec into a running sandbox to verify the per-sandbox identity flow end-to-end without writing a custom test binary inside the sandbox image. Subcommands: - get-sandbox-config --sandbox-id <UUID> — call GetSandboxConfig - refresh — call RefreshSandboxToken - show-token — print raw gateway JWT bytes - show-principal — pretty-print decoded JWT claims Verification flow this enables (Docker path): docker exec sandbox-a openshell-sandbox debug-rpc show-principal docker exec sandbox-a openshell-sandbox debug-rpc \ get-sandbox-config --sandbox-id <sandbox-b-uuid> # → exit code 7 + "PermissionDenied: cross-sandbox access denied" K8s path: same RPCs, kubectl exec instead. show-token and show-principal intentionally don't trigger the K8s SA bootstrap exchange — they only read an already-cached token, so inspection doesn't burn a fresh JWT mint per call.

github-actions · 2026-05-15T22:08:03Z

Label test:e2e applied for f4daea6. Open Branch E2E Checks, find the run for commit f4daea6, and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

TaylorMutch force-pushed the tmutch/gateway-config-impl branch 2 times, most recently from 381784e to 9bc2e11 Compare May 15, 2026 19:17

Base automatically changed from tmutch/gateway-config-impl to main May 15, 2026 19:43

TaylorMutch added 11 commits May 15, 2026 13:41

fix(helm): mount sandbox JWT keys without TLS

f4daea6

TaylorMutch force-pushed the tmutch/per-supervisor-authn branch from 834b56e to f4daea6 Compare May 15, 2026 20:41

TaylorMutch changed the title ~~feat: per-sandbox authentication~~ feat: per-sandbox authentication to gateway May 15, 2026

TaylorMutch changed the title ~~feat: per-sandbox authentication to gateway~~ feat(auth): per-sandbox authentication to gateway May 15, 2026

TaylorMutch added the test:e2e Requires end-to-end coverage label May 15, 2026

TaylorMutch marked this pull request as ready for review May 15, 2026 22:07

TaylorMutch requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 22:07

TaylorMutch mentioned this pull request May 15, 2026

feat(auth): add SPIFFE supervisor authentication #1414

Draft

7 tasks

test(e2e): configure sandbox JWT keys in harnesses

4e8ce7d

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auth): per-sandbox authentication to gateway#1404

feat(auth): per-sandbox authentication to gateway#1404
TaylorMutch wants to merge 12 commits into
mainfrom
tmutch/per-supervisor-authn

TaylorMutch commented May 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TaylorMutch commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Implementation Details

Problem Context

Shared Gateway Auth Model

Docker, Podman, And VM Bootstrap

Kubernetes Bootstrap

Supervisor Credential Resolution

Handler Authorization

Refresh And Revocation

Signing Key Persistence

Design Decisions For Reviewers

Reviewer Focus Areas

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TaylorMutch commented May 15, 2026 •

edited

Loading