Checks
Controller Version
0.13.0
Deployment Method
Helm
Checks
To Reproduce
1. Deploy an `AutoscalingRunnerSet` with init containers (e.g., a `fix-docker-sock` init container that runs `ln -sf /var/run/user/1000/docker.sock /var/run/docker.sock`)
2. Have a node where the container runtime (containerd) experiences transient failures (e.g., `failed to create containerd task: failed to create shim task: context canceled`)
3. The init container fails with `Init:StartError`, causing the Pod phase to transition to `Failed`
4. Observe that the `EphemeralRunner` resource remains in `Pending` status indefinitely
5. The failed pods accumulate and consume runner pool capacity
Describe the bug
When an EphemeralRunner pod fails during init container execution (e.g., Init:StartError due to containerd runtime failure), the EphemeralRunner controller does not effectively handle the failure. The failed pods remain and consume pool capacity, degrading the runner pool.
Describe the expected behavior
I think:
EphemeralRunner controller should inspect pod.Status.InitContainerStatuses in addition to main container statuses to properly detect and report init container failures
EphemeralRunnerSet controller should clean up Failed EphemeralRunners during normal reconciliation (not only during deletion), similar to how it cleans up Succeeded runners
When a pod fails due to init container errors, the EphemeralRunner status should transition to Failed (not remain Pending), so that:
- Operators can observe the actual state via
kubectl get ephemeralrunners
- External cleanup mechanisms (e.g., CronJobs) can target failed runners
Additional Context
Controller Logs
Runner Pod Logs
Checks
Controller Version
0.13.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
When an EphemeralRunner pod fails during init container execution (e.g.,
Init:StartErrordue to containerd runtime failure), the EphemeralRunner controller does not effectively handle the failure. The failed pods remain and consume pool capacity, degrading the runner pool.Describe the expected behavior
I think:
EphemeralRunnercontroller should inspectpod.Status.InitContainerStatusesin addition to main container statuses to properly detect and report init container failuresEphemeralRunnerSetcontroller should clean upFailedEphemeralRunners during normal reconciliation (not only during deletion), similar to how it cleans upSucceededrunnersWhen a pod fails due to init container errors, the
EphemeralRunnerstatus should transition toFailed(not remainPending), so that:kubectl get ephemeralrunnersAdditional Context
N/AController Logs
Runner Pod Logs