Skip to content

refactor!(auth): drop SSH handshake secret#1274

Merged
TaylorMutch merged 3 commits into
mainfrom
tmutch/ssh-handshake-helm-fix
May 14, 2026
Merged

refactor!(auth): drop SSH handshake secret#1274
TaylorMutch merged 3 commits into
mainfrom
tmutch/ssh-handshake-helm-fix

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

@TaylorMutch TaylorMutch commented May 8, 2026

Summary

Removes the misnamed `OPENSHELL_SSH_HANDSHAKE_SECRET` (`x-sandbox-secret`) mechanism. It did not authenticate SSH — SSH flows over the `RelayStream` gRPC RPC and is gated by mTLS plus the supervisor's Unix-socket permissions — it only gated a small set of sandbox→gateway control-plane RPCs. Production deployments already enforce mTLS on that channel, so the shared secret was redundant.

Related Issue

Refs OS-174.

Changes

  • Server auth: deleted `SANDBOX_SECRET_METHODS` interceptor, `validate_sandbox_secret`, and `mark_sandbox_secret_authenticated` from `auth/oidc.rs`. New `is_sandbox_caller` / `mark_sandbox_caller` plus `AUTH_SOURCE_SANDBOX` marker. `AuthGrpcRouter` now marks sandbox-class methods as sandbox-caller and routes dual-auth methods on Bearer-present vs Bearer-absent. `validate_sandbox_secret_update` → `validate_sandbox_caller_update` in `grpc/policy.rs`.
  • Server CLI/startup: removed `--ssh-handshake-secret` and `--ssh-handshake-skew-secs` (gateway + drivers). Removed the empty-secret startup gate.
  • Sandbox client: dropped `SandboxSecretInterceptor` and the env scrub. `OpenShellClient`/`InferenceClient` now use bare `Channel`. mTLS via `connect_channel` is unchanged.
  • Drivers: K8s, Podman, and VM no longer inject the secret or skew env vars. Podman driver no longer creates/deletes a per-sandbox Podman secret. Tests rewritten to assert absence.
  • Core config: removed `ssh_handshake_secret`, `ssh_handshake_skew_secs`, `DEFAULT_SSH_HANDSHAKE_SKEW_SECS`, and the builder methods.
  • Helm chart: deleted `ssh-handshake-secret-hook.yaml`; removed `server.sshHandshakeSecretName`, the `sshHandshake:` block, and the `OPENSHELL_SSH_HANDSHAKE_SECRET` env entry from the StatefulSet.
  • Docs: removed handshake refs from gateway man pages, RPM `CONFIGURATION.md` + `init-gateway-env.sh`, podman README, and `debug-openshell-cluster` skill. Updated `docs/reference/gateway-auth.mdx` to describe the mTLS-based sandbox-caller model.

Breaking changes

  • `--ssh-handshake-secret` / `OPENSHELL_SSH_HANDSHAKE_SECRET` and `--ssh-handshake-skew-secs` / `OPENSHELL_SSH_HANDSHAKE_SKEW_SECS` are removed from the gateway, sandbox, and all driver binaries.
  • The Helm chart no longer manages the `openshell-ssh-handshake` Secret. After upgrade, operators may delete the orphan.
  • Deployments using `--disable-gateway-auth` must enforce caller authentication at the fronting proxy, since the gateway no longer validates a per-request secret on sandbox-class methods.

Testing

  • `mise run pre-commit` passes
  • `mise run test` passes (workspace unit tests)
  • `mise run ci` passes (lint + checks + tests)
  • `mise run e2e` — could not run locally (mise tool-install hit GitHub API rate limits during cross-compile setup; unrelated to this change). Needs CI / fresh env.

Checklist

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@TaylorMutch TaylorMutch force-pushed the tmutch/ssh-handshake-helm-fix branch from 0f5b57c to 72be22f Compare May 8, 2026 23:13
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

@TaylorMutch TaylorMutch changed the title refactor!(auth): drop SSH handshake secret in favor of mTLS refactor!(auth): drop SSH handshake secret May 14, 2026
The OPENSHELL_SSH_HANDSHAKE_SECRET / x-sandbox-secret mechanism was
misnamed: it does not authenticate SSH (which flows over the
RelayStream gRPC RPC and is gated by mTLS plus supervisor Unix-socket
permissions). It only gated a small set of sandbox-to-gateway
control-plane RPCs, and production deployments already enforce mTLS
on that channel — so the shared secret was redundant.

Replace the secret check with an mTLS-presence marker. Sandbox-class
methods (ReportPolicyStatus, PushSandboxLogs,
GetSandboxProviderEnvironment, SubmitPolicyAnalysis, GetSandboxConfig,
GetInferenceBundle) accept callers without a Bearer token; the gRPC
mTLS handshake is the trust boundary. Dual-auth methods treat
Bearer-present as full-scope CLI access and Bearer-absent as
sandbox-restricted scope via validate_sandbox_caller_update.

Drops the secret from all drivers (K8s, Podman, VM), the sandbox gRPC
interceptor, the Helm chart (values + pre-install hook + StatefulSet
env), the RPM bootstrap script, the man pages, and the
debug-openshell-cluster skill. Also removes the never-read
ssh_handshake_skew_secs flag and config field.

BREAKING CHANGE: --ssh-handshake-secret / OPENSHELL_SSH_HANDSHAKE_SECRET
and --ssh-handshake-skew-secs / OPENSHELL_SSH_HANDSHAKE_SKEW_SECS are
removed from the gateway, sandbox, and all driver binaries. The
openshell-ssh-handshake K8s Secret is no longer managed by the chart;
operators may delete the orphan. Deployments using
--disable-gateway-auth must enforce caller authentication at the
fronting proxy, since the gateway no longer validates a per-request
secret on sandbox-class methods.

Refs OS-174.
@TaylorMutch TaylorMutch force-pushed the tmutch/ssh-handshake-helm-fix branch from 72be22f to 1df4404 Compare May 14, 2026 16:56
@TaylorMutch TaylorMutch marked this pull request as ready for review May 14, 2026 16:57
@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label May 14, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for 1df4404. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

Sweep across docs, e2e scripts, the Podman driver README/NETWORKING
notes, the RPM/Helm/setup guides, the gateway man page, and RFC 0003
to remove instructions and examples that still referenced
OPENSHELL_SSH_HANDSHAKE_SECRET / --ssh-handshake-secret /
ssh_handshake_skew_secs. The mechanism is gone; nothing should still
suggest setting it. Negative-assertion regression tests are kept so
the env var cannot silently be re-introduced.
The supporting code, env vars, CLI flags, and config plumbing are
gone — these tests asserted absence of strings that no longer have
any path to being set. Remove the guards from the Docker, Podman,
Kubernetes, and VM driver test modules.
@drew
Copy link
Copy Markdown
Collaborator

drew commented May 14, 2026

Update VM e2e after removing the CLI flag — crates/openshell-server/src/cli.rs:135-141 Removing the --ssh-handshake-secret option here makes the gateway reject the still-checked-in e2e/rust/e2e-vm.sh command used by mise run e2e:vm, because that script still passes --ssh-handshake-secret "${SSH_HANDSHAKE_SECRET}".

Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just the one comment around updating e2e vm tests

@mrunalp
Copy link
Copy Markdown
Collaborator

mrunalp commented May 14, 2026

Do we want to add a warning log for when someone passes --disable-gateway-auth? for
Deployments using --disable-gateway-auth must enforce caller authentication at the fronting proxy, since the gateway no longer validates a per-request secret on sandbox-class methods.

@TaylorMutch
Copy link
Copy Markdown
Collaborator Author

Do we want to add a warning log for when someone passes --disable-gateway-auth? for

Deployments using --disable-gateway-auth must enforce caller authentication at the fronting proxy, since the gateway no longer validates a per-request secret on sandbox-class methods.

We can; this isn't a new problem though, we have raised #1354 which captures the problem overall.

@TaylorMutch
Copy link
Copy Markdown
Collaborator Author

Do we want to add a warning log for when someone passes --disable-gateway-auth? for
Deployments using --disable-gateway-auth must enforce caller authentication at the fronting proxy, since the gateway no longer validates a per-request secret on sandbox-class methods.

We can; this isn't a new problem though, we have raised #1354 which captures the problem overall.

Going to work on this as a follow up.

@TaylorMutch TaylorMutch merged commit 7a0c444 into main May 14, 2026
36 of 38 checks passed
@TaylorMutch TaylorMutch deleted the tmutch/ssh-handshake-helm-fix branch May 14, 2026 20:14
TaylorMutch pushed a commit that referenced this pull request May 15, 2026
)

* chore: remove SSH handshake secret residuals and fix agent memory

Removes three artifacts left behind by the #1274 removal of
OPENSHELL_SSH_HANDSHAKE_SECRET, then corrects a sweep of stale and
inaccurate notes in .claude/agent-memory/arch-doc-writer/MEMORY.md
that were discovered during the audit.

Artifact removal (Refs OS-174):
- openshell.spec: stale comment claiming init-gateway-env.sh generates
  an SSH handshake secret
- e2e/with-podman-gateway.sh: dead podman secret rm for
  openshell-handshake-<id>, which is never created since #1274

Agent memory corrections:
- ssh_tunnel.rs no longer exists; replaced by ssh_sessions.rs
- Object types list was missing service_endpoint and provider_profile
- Pre-exec chain now includes harden_child_process() and uses the
  linux::prepare()/enforce() two-phase pattern on Linux
- CLI SSH function list had nonexistent sandbox_rsync; corrected to
  actual exported functions
- ExecSandbox is in grpc/sandbox.rs (not grpc.rs) and operates over
  a supervisor relay DuplexStream, not a direct TCP connection
- resolve_ssh_gateway() moved to openshell-core/src/forward.rs
- SSH transport note rewrote: NSSH1 is an OCSF-only label (not a live
  protocol preface); actual path is ForwardTcp -> DuplexStream ->
  RelayStream -> Unix socket; access gated by CreateSshSession token;
  TLS follows endpoint scheme (https:// = mTLS, http:// = plaintext;
  Podman driver does not yet inject mTLS client materials)
- CLI flag note was self-contradictory; corrected to --gateway-endpoint
  with resolution priority chain

* fix(vm): collapse nested if blocks to satisfy clippy::collapsible_if

Three nested if blocks in connect_local_container_engine() were
flagged by clippy after #1370. Collapse to single if-let chains
using && as suggested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants