Skip to content

EphemeralRunnerSet controller does not clean up runners stuck in init container failure #4452

@Okabe-Junya

Description

@Okabe-Junya

Checks

Controller Version

0.13.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy an `AutoscalingRunnerSet` with init containers (e.g., a `fix-docker-sock` init container that runs `ln -sf /var/run/user/1000/docker.sock /var/run/docker.sock`)
2. Have a node where the container runtime (containerd) experiences transient failures (e.g., `failed to create containerd task: failed to create shim task: context canceled`)
3. The init container fails with `Init:StartError`, causing the Pod phase to transition to `Failed`
4. Observe that the `EphemeralRunner` resource remains in `Pending` status indefinitely
5. The failed pods accumulate and consume runner pool capacity

Describe the bug

When an EphemeralRunner pod fails during init container execution (e.g., Init:StartError due to containerd runtime failure), the EphemeralRunner controller does not effectively handle the failure. The failed pods remain and consume pool capacity, degrading the runner pool.

Describe the expected behavior

I think:

  • EphemeralRunner controller should inspect pod.Status.InitContainerStatuses in addition to main container statuses to properly detect and report init container failures
  • EphemeralRunnerSet controller should clean up Failed EphemeralRunners during normal reconciliation (not only during deletion), similar to how it cleans up Succeeded runners

When a pod fails due to init container errors, the EphemeralRunner status should transition to Failed (not remain Pending), so that:

  • Operators can observe the actual state via kubectl get ephemeralrunners
  • External cleanup mechanisms (e.g., CronJobs) can target failed runners

Additional Context

N/A

Controller Logs

N/A

Runner Pod Logs

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions