Skip to content

feat: add GPU count support for Kubernetes sandboxes #1338

@ryana

Description

@ryana

Problem Statement

OpenShell can express generic GPU intent with openshell sandbox create --gpu, but users cannot request a specific GPU count through the public sandbox API.

For Kubernetes-backed gateways, generic GPU intent maps to a single nvidia.com/gpu resource request. This blocks workloads that need multiple GPUs, for example:

openshell sandbox create --gpu-count 4 -- claude

Users can work around this only by injecting Kubernetes-specific resource settings through sandbox templates. That makes a common scheduling requirement driver-specific and bypasses OpenShell's typed sandbox spec layer.

Proposed Design

Add first-class GPU count support across the public sandbox spec, compute-driver spec, CLI, server mapping, and Kubernetes driver.

Public API:

  • Add gpu_count to SandboxSpec.
  • Use default 0 to mean unspecified/default.
  • Use values >0 to request that many GPUs.
  • Preserve existing gpu: true behavior.

Compute driver API:

  • Add gpu_count to DriverSandboxSpec.
  • Copy SandboxSpec.gpu_count into DriverSandboxSpec.gpu_count in the server public-to-driver mapping.

CLI:

  • Add openshell sandbox create --gpu-count COUNT.
  • Reject --gpu-count 0.
  • Treat --gpu-count N as GPU intent, equivalent to setting gpu: true.
  • Reject combining --gpu-count with --gpu-device, because count-based scheduling and device-specific selection are different allocation modes.

Kubernetes driver:

  • If gpu_count > 0, set the sandbox container resource limit:
resources:
  limits:
    nvidia.com/gpu: "<count>"
  • If gpu_count == 0 and gpu == true, preserve current behavior by requesting one GPU.
  • Preserve existing CPU, memory, custom resource, and typed-resource overlay behavior.
  • Require clusters to expose allocatable nvidia.com/gpu resources through the NVIDIA device plugin or equivalent.

Compatibility:

  • Existing clients omit gpu_count, so it defaults to 0.
  • Existing --gpu behavior remains unchanged.
  • Docker, Podman, and VM drivers can safely receive the new field and ignore it unless they later add explicit count support.

Acceptance criteria:

  • openshell sandbox create --gpu-count 4 -- claude sends SandboxSpec { gpu: true, gpu_count: 4 }.
  • --gpu-count 0 is rejected with a clear error.
  • --gpu-count cannot be combined with --gpu-device.
  • Server mapping copies public gpu_count into the driver spec.
  • Kubernetes pod rendering emits limits["nvidia.com/gpu"] == "4" for gpu_count: 4.
  • Existing --gpu still emits limits["nvidia.com/gpu"] == "1".
  • Docs explain --gpu-count, Kubernetes nvidia.com/gpu scheduling, and the --gpu-device conflict.

Alternatives Considered

  • Continue injecting nvidia.com/gpu through raw template resources.
    • This works only for users who know the Kubernetes resource model and bypasses OpenShell's typed sandbox API.
  • Overload --gpu with an optional value.
    • This is ambiguous and risks breaking existing boolean flag behavior.
  • Reuse --gpu-device for counts.
    • Device-specific selection and count-based scheduling are separate allocation modes, so combining them would make driver behavior unclear.

Agent Investigation

  • Inspected the existing proto contracts, CLI sandbox-create path, server compute mapping, and Kubernetes driver rendering path.
  • Found that OpenShell already has a public-to-driver sandbox spec mapping layer, so GPU count belongs in typed specs rather than template resource passthrough.
  • Found existing Kubernetes GPU behavior maps generic gpu: true to one nvidia.com/gpu limit.
  • Identified docs that need updates: sandbox management docs, Kubernetes setup prerequisites, Kubernetes driver README, and compute runtime architecture docs.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions