Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions .agents/skills/test-release-canary/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
name: test-release-canary
description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary.
---

# Test Release Canary

The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment.

## What the canary verifies

| Job | Runner | Verifies |
|---|---|---|
| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. |
| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. |
| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. |
| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. |

`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images.

## Trigger paths

The workflow has two triggers:

```yaml
on:
workflow_dispatch:
workflow_run:
workflows: ["Release Dev"]
types: [completed]
```

- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary.
- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition.

When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL.

## Manual dispatch

Run the canary as-is on the current branch:

```shell
gh workflow run release-canary.yml --ref "$(git branch --show-current)"
```

Watch the run that starts:

```shell
sleep 5 # let GitHub register the dispatch
gh run list --workflow release-canary.yml --limit 1
gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')"
```

View only failed jobs after completion:

```shell
gh run view <run-id> --log-failed
```

## Iterating on the canary itself

When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts.

Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag.

## Testing artifacts from a specific SHA

`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`):

- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push.
- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.<sha>` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:<sha>` and `:supervisor:<sha>`.

To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs.

## Local kind reproduction

The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`:

```shell
kind create cluster --name release-canary-local

helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \
--version 0.0.0-dev \
--namespace openshell --create-namespace \
--set server.disableTls=true \
--set server.disableGatewayAuth=true \
--set pkiInitJob.enabled=false \
--wait --timeout 5m

kubectl wait --namespace openshell \
--for=condition=Ready pod \
--selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \
--timeout=300s

kubectl port-forward --namespace openshell svc/openshell 8080:8080 &
openshell gateway add http://127.0.0.1:8080 --local --name kind
openshell status
```

Swap `0.0.0-dev` for `0.0.0-dev.<sha>` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`.

Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install.

## Diagnosing failures

| Symptom | Likely cause | Where to look |
|---|---|---|
| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. |
| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. |
| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. |
| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. |
| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. |

The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs.

## Related

- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts).
- `watch-github-actions` skill — generic `gh run` workflow monitoring.
- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump.
93 changes: 93 additions & 0 deletions .github/workflows/release-canary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,96 @@ jobs:
run: |
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/${{ github.event.workflow_run.head_sha || github.sha }}/install.sh | sh
openshell status

kubernetes:
name: Kubernetes Helm (kind)
if: ${{ github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
timeout-minutes: 20
env:
KIND_CLUSTER_NAME: release-canary-${{ github.run_id }}
RELEASE_NAME: openshell
RELEASE_NAMESPACE: openshell
KIND_GATEWAY_NAME: kind
steps:
- name: Install Helm
uses: azure/setup-helm@v4

- name: Create kind cluster
uses: helm/kind-action@v1
with:
cluster_name: ${{ env.KIND_CLUSTER_NAME }}
wait: 120s

- name: Install OpenShell Helm chart from GHCR OCI
run: |
set -euo pipefail
helm install "$RELEASE_NAME" oci://ghcr.io/nvidia/openshell/helm-chart \
--version 0.0.0-dev \
--namespace "$RELEASE_NAMESPACE" --create-namespace \
--set server.disableTls=true \
--set server.disableGatewayAuth=true \
--set pkiInitJob.enabled=false \
--wait --timeout 5m

- name: Verify gateway pod is Ready
run: |
set -euo pipefail
kubectl wait --namespace "$RELEASE_NAMESPACE" \
--for=condition=Ready pod \
--selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=${RELEASE_NAME}" \
--timeout=300s

- name: Port-forward gateway service
run: |
set -euo pipefail
nohup kubectl port-forward --namespace "$RELEASE_NAMESPACE" \
"svc/${RELEASE_NAME}" 8080:8080 \
> port-forward.log 2>&1 &
echo $! > port-forward.pid
for _ in $(seq 1 30); do
if (echo > /dev/tcp/127.0.0.1/8080) >/dev/null 2>&1; then
echo "port-forward is reachable"
exit 0
fi
sleep 1
done
echo "port-forward did not become reachable" >&2
cat port-forward.log >&2
exit 1

- name: Install OpenShell CLI
run: |
set -euo pipefail
mkdir -p "${HOME}/.config/openshell"
printf 'OPENSHELL_DRIVERS=docker\n' > "${HOME}/.config/openshell/gateway.env"
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/${{ github.event.workflow_run.head_sha || github.sha }}/install.sh | sh

- name: Register kind gateway and check status
run: |
set -euo pipefail
openshell gateway add http://127.0.0.1:8080 --local --name "$KIND_GATEWAY_NAME"
openshell status

- name: Diagnostics on failure
if: failure()
run: |
set +e
echo "--- helm status ---"
helm status "$RELEASE_NAME" --namespace "$RELEASE_NAMESPACE"
echo "--- helm get manifest ---"
helm get manifest "$RELEASE_NAME" --namespace "$RELEASE_NAMESPACE"
echo "--- get all ---"
kubectl get all --namespace "$RELEASE_NAMESPACE"
echo "--- describe pods ---"
kubectl describe pods --namespace "$RELEASE_NAMESPACE"
echo "--- pod logs ---"
kubectl logs --namespace "$RELEASE_NAMESPACE" \
--selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=${RELEASE_NAME}" \
--tail=200 --all-containers --prefix
echo "--- port-forward log ---"
cat port-forward.log 2>/dev/null
echo "--- openshell gateway list ---"
openshell gateway list 2>/dev/null
echo "--- openshell version ---"
openshell --version 2>/dev/null
3 changes: 3 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,10 @@ These pipelines connect skills into end-to-end workflows. Individual skill files
| `crates/openshell-tui/` | Terminal UI | Ratatui-based dashboard for monitoring |
| `crates/openshell-driver-kubernetes/` | Kubernetes compute driver | In-process `ComputeDriver` backend for K8s sandbox pods |
| `crates/openshell-driver-docker/` | Docker compute driver | In-process `ComputeDriver` backend for local Docker sandbox containers |
| `crates/openshell-driver-podman/` | Podman compute driver | In-process `ComputeDriver` backend for rootless Podman sandbox containers (single-host / dev) |
| `crates/openshell-driver-vm/` | VM compute driver | Standalone libkrun-backed `ComputeDriver` subprocess (embeds its own rootfs + runtime) |
| `crates/openshell-prover/` | Policy verifier | Formal policy verification (Z3-backed) for sandbox policy correctness |
| `crates/openshell-vfio/` | VFIO GPU passthrough | VFIO GPU passthrough lifecycle for VM-driver sandboxes |
| `python/openshell/` | Python SDK | Python bindings and CLI packaging |
| `proto/` | Protobuf definitions | gRPC service contracts |
| `deploy/` | Docker, Helm, K8s | Dockerfiles, Helm chart, manifests |
Expand Down
10 changes: 10 additions & 0 deletions CI.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,3 +111,13 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma
| `.github/workflows/e2e-gate.yml` | Posts the required `E2E Gate` check on the PR. Re-evaluates after the gated workflow completes. |
| `.github/workflows/e2e-gate-check.yml` | Reusable gate logic shared by E2E and GPU E2E. |
| `.github/workflows/e2e-label-help.yml` | When a `test:e2e*` label is applied, posts a PR comment telling the maintainer the next manual step (re-run an existing workflow run, or `/ok to test <SHA>` to refresh the mirror). |

## Release workflows

These workflows run after merge to publish dev/tagged artifacts and verify them. They are not PR-gated.

| File | Role |
|---|---|
| `.github/workflows/release-dev.yml` | Publishes the rolling `dev` build on every push to `main`. Builds gateway/supervisor images and binaries, packages, wheels, and pushes the Helm chart as `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` (plus an immutable `0.0.0-dev.<sha>` pin). Also dispatchable manually. |
| `.github/workflows/release-tag.yml` | Publishes a tagged public release. |
| `.github/workflows/release-canary.yml` | Smoke-tests published artifacts on `macos`, `ubuntu`, `fedora`, and `kubernetes` (kind + Helm) runners. Triggers automatically when `Release Dev` succeeds, and via `workflow_dispatch` on any branch (`gh workflow run release-canary.yml --ref <branch>`). The `kubernetes` job pins to `0.0.0-dev` artifacts; the other jobs install the latest tagged release via `install.sh`. See the `test-release-canary` skill for the manual-dispatch playbook and local kind reproduction. |
3 changes: 3 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,16 @@ Skills live in `.agents/skills/`. Your agent's harness can discover and load the
| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows |
| Getting Started | `debug-openshell-cluster` | Diagnose gateway deployment and health issues |
| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues |
| Getting Started | `helm-dev-environment` | Start, configure, and tear down the local k3d + Skaffold + Helm dev environment |
| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue |
| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) |
| Contributing | `create-github-issue` | Create well-structured GitHub issues |
| Contributing | `create-github-pr` | Create pull requests with proper conventions |
| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions |
| Reviewing | `review-security-issue` | Assess security issues for severity and remediation |
| Reviewing | `fix-security-issue` | Implement a fix for a reviewed security issue once `state:agent-ready` is applied |
| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs |
| Reviewing | `test-release-canary` | Dispatch and iterate on the Release Canary workflow that smoke-tests published artifacts |
| Triage | `triage-issue` | Assess, classify, and route community-filed issues |
| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs |
| Platform | `tui-development` | Development guide for the ratatui-based terminal UI |
Expand Down
Loading