diff --git a/.agents/skills/test-release-canary/SKILL.md b/.agents/skills/test-release-canary/SKILL.md new file mode 100644 index 000000000..314e3f188 --- /dev/null +++ b/.agents/skills/test-release-canary/SKILL.md @@ -0,0 +1,120 @@ +--- +name: test-release-canary +description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary. +--- + +# Test Release Canary + +The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment. + +## What the canary verifies + +| Job | Runner | Verifies | +|---|---|---| +| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. | +| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. | +| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. | +| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. | + +`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images. + +## Trigger paths + +The workflow has two triggers: + +```yaml +on: + workflow_dispatch: + workflow_run: + workflows: ["Release Dev"] + types: [completed] +``` + +- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary. +- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition. + +When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL. + +## Manual dispatch + +Run the canary as-is on the current branch: + +```shell +gh workflow run release-canary.yml --ref "$(git branch --show-current)" +``` + +Watch the run that starts: + +```shell +sleep 5 # let GitHub register the dispatch +gh run list --workflow release-canary.yml --limit 1 +gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')" +``` + +View only failed jobs after completion: + +```shell +gh run view --log-failed +``` + +## Iterating on the canary itself + +When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts. + +Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag. + +## Testing artifacts from a specific SHA + +`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`): + +- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push. +- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:` and `:supervisor:`. + +To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs. + +## Local kind reproduction + +The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`: + +```shell +kind create cluster --name release-canary-local + +helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \ + --version 0.0.0-dev \ + --namespace openshell --create-namespace \ + --set server.disableTls=true \ + --set server.disableGatewayAuth=true \ + --set pkiInitJob.enabled=false \ + --wait --timeout 5m + +kubectl wait --namespace openshell \ + --for=condition=Ready pod \ + --selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \ + --timeout=300s + +kubectl port-forward --namespace openshell svc/openshell 8080:8080 & +openshell gateway add http://127.0.0.1:8080 --local --name kind +openshell status +``` + +Swap `0.0.0-dev` for `0.0.0-dev.` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`. + +Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install. + +## Diagnosing failures + +| Symptom | Likely cause | Where to look | +|---|---|---| +| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. | +| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. | +| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. | +| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. | +| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. | + +The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs. + +## Related + +- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts). +- `watch-github-actions` skill — generic `gh run` workflow monitoring. +- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump. diff --git a/.github/workflows/release-canary.yml b/.github/workflows/release-canary.yml index 61f8a8a1e..abb227cb7 100644 --- a/.github/workflows/release-canary.yml +++ b/.github/workflows/release-canary.yml @@ -71,3 +71,96 @@ jobs: run: | curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/${{ github.event.workflow_run.head_sha || github.sha }}/install.sh | sh openshell status + + kubernetes: + name: Kubernetes Helm (kind) + if: ${{ github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success' }} + runs-on: ubuntu-latest + timeout-minutes: 20 + env: + KIND_CLUSTER_NAME: release-canary-${{ github.run_id }} + RELEASE_NAME: openshell + RELEASE_NAMESPACE: openshell + KIND_GATEWAY_NAME: kind + steps: + - name: Install Helm + uses: azure/setup-helm@v4 + + - name: Create kind cluster + uses: helm/kind-action@v1 + with: + cluster_name: ${{ env.KIND_CLUSTER_NAME }} + wait: 120s + + - name: Install OpenShell Helm chart from GHCR OCI + run: | + set -euo pipefail + helm install "$RELEASE_NAME" oci://ghcr.io/nvidia/openshell/helm-chart \ + --version 0.0.0-dev \ + --namespace "$RELEASE_NAMESPACE" --create-namespace \ + --set server.disableTls=true \ + --set server.disableGatewayAuth=true \ + --set pkiInitJob.enabled=false \ + --wait --timeout 5m + + - name: Verify gateway pod is Ready + run: | + set -euo pipefail + kubectl wait --namespace "$RELEASE_NAMESPACE" \ + --for=condition=Ready pod \ + --selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=${RELEASE_NAME}" \ + --timeout=300s + + - name: Port-forward gateway service + run: | + set -euo pipefail + nohup kubectl port-forward --namespace "$RELEASE_NAMESPACE" \ + "svc/${RELEASE_NAME}" 8080:8080 \ + > port-forward.log 2>&1 & + echo $! > port-forward.pid + for _ in $(seq 1 30); do + if (echo > /dev/tcp/127.0.0.1/8080) >/dev/null 2>&1; then + echo "port-forward is reachable" + exit 0 + fi + sleep 1 + done + echo "port-forward did not become reachable" >&2 + cat port-forward.log >&2 + exit 1 + + - name: Install OpenShell CLI + run: | + set -euo pipefail + mkdir -p "${HOME}/.config/openshell" + printf 'OPENSHELL_DRIVERS=docker\n' > "${HOME}/.config/openshell/gateway.env" + curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/${{ github.event.workflow_run.head_sha || github.sha }}/install.sh | sh + + - name: Register kind gateway and check status + run: | + set -euo pipefail + openshell gateway add http://127.0.0.1:8080 --local --name "$KIND_GATEWAY_NAME" + openshell status + + - name: Diagnostics on failure + if: failure() + run: | + set +e + echo "--- helm status ---" + helm status "$RELEASE_NAME" --namespace "$RELEASE_NAMESPACE" + echo "--- helm get manifest ---" + helm get manifest "$RELEASE_NAME" --namespace "$RELEASE_NAMESPACE" + echo "--- get all ---" + kubectl get all --namespace "$RELEASE_NAMESPACE" + echo "--- describe pods ---" + kubectl describe pods --namespace "$RELEASE_NAMESPACE" + echo "--- pod logs ---" + kubectl logs --namespace "$RELEASE_NAMESPACE" \ + --selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=${RELEASE_NAME}" \ + --tail=200 --all-containers --prefix + echo "--- port-forward log ---" + cat port-forward.log 2>/dev/null + echo "--- openshell gateway list ---" + openshell gateway list 2>/dev/null + echo "--- openshell version ---" + openshell --version 2>/dev/null diff --git a/AGENTS.md b/AGENTS.md index 2395b8176..1d7279816 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -41,7 +41,10 @@ These pipelines connect skills into end-to-end workflows. Individual skill files | `crates/openshell-tui/` | Terminal UI | Ratatui-based dashboard for monitoring | | `crates/openshell-driver-kubernetes/` | Kubernetes compute driver | In-process `ComputeDriver` backend for K8s sandbox pods | | `crates/openshell-driver-docker/` | Docker compute driver | In-process `ComputeDriver` backend for local Docker sandbox containers | +| `crates/openshell-driver-podman/` | Podman compute driver | In-process `ComputeDriver` backend for rootless Podman sandbox containers (single-host / dev) | | `crates/openshell-driver-vm/` | VM compute driver | Standalone libkrun-backed `ComputeDriver` subprocess (embeds its own rootfs + runtime) | +| `crates/openshell-prover/` | Policy verifier | Formal policy verification (Z3-backed) for sandbox policy correctness | +| `crates/openshell-vfio/` | VFIO GPU passthrough | VFIO GPU passthrough lifecycle for VM-driver sandboxes | | `python/openshell/` | Python SDK | Python bindings and CLI packaging | | `proto/` | Protobuf definitions | gRPC service contracts | | `deploy/` | Docker, Helm, K8s | Dockerfiles, Helm chart, manifests | diff --git a/CI.md b/CI.md index 57e6627ed..01d72a330 100644 --- a/CI.md +++ b/CI.md @@ -111,3 +111,13 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma | `.github/workflows/e2e-gate.yml` | Posts the required `E2E Gate` check on the PR. Re-evaluates after the gated workflow completes. | | `.github/workflows/e2e-gate-check.yml` | Reusable gate logic shared by E2E and GPU E2E. | | `.github/workflows/e2e-label-help.yml` | When a `test:e2e*` label is applied, posts a PR comment telling the maintainer the next manual step (re-run an existing workflow run, or `/ok to test ` to refresh the mirror). | + +## Release workflows + +These workflows run after merge to publish dev/tagged artifacts and verify them. They are not PR-gated. + +| File | Role | +|---|---| +| `.github/workflows/release-dev.yml` | Publishes the rolling `dev` build on every push to `main`. Builds gateway/supervisor images and binaries, packages, wheels, and pushes the Helm chart as `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` (plus an immutable `0.0.0-dev.` pin). Also dispatchable manually. | +| `.github/workflows/release-tag.yml` | Publishes a tagged public release. | +| `.github/workflows/release-canary.yml` | Smoke-tests published artifacts on `macos`, `ubuntu`, `fedora`, and `kubernetes` (kind + Helm) runners. Triggers automatically when `Release Dev` succeeds, and via `workflow_dispatch` on any branch (`gh workflow run release-canary.yml --ref `). The `kubernetes` job pins to `0.0.0-dev` artifacts; the other jobs install the latest tagged release via `install.sh`. See the `test-release-canary` skill for the manual-dispatch playbook and local kind reproduction. | diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index aa3d1b0a9..e693b76eb 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -68,13 +68,16 @@ Skills live in `.agents/skills/`. Your agent's harness can discover and load the | Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows | | Getting Started | `debug-openshell-cluster` | Diagnose gateway deployment and health issues | | Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues | +| Getting Started | `helm-dev-environment` | Start, configure, and tear down the local k3d + Skaffold + Helm dev environment | | Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue | | Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) | | Contributing | `create-github-issue` | Create well-structured GitHub issues | | Contributing | `create-github-pr` | Create pull requests with proper conventions | | Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions | | Reviewing | `review-security-issue` | Assess security issues for severity and remediation | +| Reviewing | `fix-security-issue` | Implement a fix for a reviewed security issue once `state:agent-ready` is applied | | Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs | +| Reviewing | `test-release-canary` | Dispatch and iterate on the Release Canary workflow that smoke-tests published artifacts | | Triage | `triage-issue` | Assess, classify, and route community-filed issues | | Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs | | Platform | `tui-development` | Development guide for the ratatui-based terminal UI |