feat: add GPU count support for Kubernetes sandboxes

### Problem Statement

OpenShell can express generic GPU intent with `openshell sandbox create --gpu`, but users cannot request a specific GPU count through the public sandbox API.

For Kubernetes-backed gateways, generic GPU intent maps to a single `nvidia.com/gpu` resource request. This blocks workloads that need multiple GPUs, for example:

```sh
openshell sandbox create --gpu-count 4 -- claude
```

Users can work around this only by injecting Kubernetes-specific resource settings through sandbox templates. That makes a common scheduling requirement driver-specific and bypasses OpenShell's typed sandbox spec layer.

### Proposed Design


Add first-class GPU count support across the public sandbox spec, compute-driver spec, CLI, server mapping, and Kubernetes driver.

Public API:

- Add gpu_count to SandboxSpec.
- Use default 0 to mean unspecified/default.
- Use values >0 to request that many GPUs.
- Preserve existing gpu: true behavior.

Compute driver API:

- Add gpu_count to DriverSandboxSpec.
- Copy SandboxSpec.gpu_count into DriverSandboxSpec.gpu_count in the server public-to-driver mapping.

CLI:

- Add openshell sandbox create --gpu-count COUNT.
- Reject --gpu-count 0.
- Treat --gpu-count N as GPU intent, equivalent to setting gpu: true.
- Reject combining --gpu-count with --gpu-device, because count-based scheduling and device-specific selection are different allocation modes.

Kubernetes driver:

- If gpu_count > 0, set the sandbox container resource limit:

```yaml
resources:
  limits:
    nvidia.com/gpu: "<count>"
```

- If gpu_count == 0 and gpu == true, preserve current behavior by requesting one GPU.
- Preserve existing CPU, memory, custom resource, and typed-resource overlay behavior.
- Require clusters to expose allocatable nvidia.com/gpu resources through the NVIDIA device plugin or equivalent.

Compatibility:

- Existing clients omit gpu_count, so it defaults to 0.
- Existing --gpu behavior remains unchanged.
- Docker, Podman, and VM drivers can safely receive the new field and ignore it unless they later add explicit count support.

Acceptance criteria:

- openshell sandbox create --gpu-count 4 -- claude sends SandboxSpec { gpu: true, gpu_count: 4 }.
- --gpu-count 0 is rejected with a clear error.
- --gpu-count cannot be combined with --gpu-device.
- Server mapping copies public gpu_count into the driver spec.
- Kubernetes pod rendering emits limits["nvidia.com/gpu"] == "4" for gpu_count: 4.
- Existing --gpu still emits limits["nvidia.com/gpu"] == "1".
- Docs explain --gpu-count, Kubernetes nvidia.com/gpu scheduling, and the --gpu-device conflict.

### Alternatives Considered

- Continue injecting nvidia.com/gpu through raw template resources.
    - This works only for users who know the Kubernetes resource model and bypasses OpenShell's typed sandbox API.
- Overload --gpu with an optional value.
    - This is ambiguous and risks breaking existing boolean flag behavior.
- Reuse --gpu-device for counts.
    - Device-specific selection and count-based scheduling are separate allocation modes, so combining them would make driver behavior unclear.

### Agent Investigation

- Inspected the existing proto contracts, CLI sandbox-create path, server compute mapping, and Kubernetes driver rendering path.
- Found that OpenShell already has a public-to-driver sandbox spec mapping layer, so GPU count belongs in typed specs rather than template resource passthrough.
- Found existing Kubernetes GPU behavior maps generic gpu: true to one nvidia.com/gpu limit.
- Identified docs that need updates: sandbox management docs, Kubernetes setup prerequisites, Kubernetes driver README, and compute runtime architecture docs.

### Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GPU count support for Kubernetes sandboxes #1338

Problem Statement

Proposed Design

Alternatives Considered

Agent Investigation

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: add GPU count support for Kubernetes sandboxes #1338

Description

Problem Statement

Proposed Design

Alternatives Considered

Agent Investigation

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions