Bug Report
Description
The operator-controller enters a tight continuous reconcile loop when managing a ClusterExtension with a large bundle (e.g., knative-operator), consuming ~1 full CPU core indefinitely even when the bundle is fully installed and no changes are needed.
Environment
- operator-controller version: v1.5.1 (also verified against
main branch / v1.7.0 source)
- Kubernetes: k3s v1.34.1
- ClusterExtension:
knative-operator v1.18.0 (large bundle with many CRDs, ClusterRoles, Deployments)
Steps to Reproduce
- Install operator-controller
- Create a ClusterExtension for a large bundle like
knative-operator:
apiVersion: olm.operatorframework.io/v1
kind: ClusterExtension
metadata:
name: knative-operator
spec:
namespace: operators
serviceAccount:
name: knative-olm
source:
catalog:
packageName: knative-operator
version: "1.18.0"
sourceType: Catalog
- Wait for the bundle to be fully installed (status shows
Installed: True, Progressing: True, reason: Succeeded)
- Observe CPU usage of the operator-controller pod
Expected Behavior
Once the bundle is fully installed and no changes are pending, the controller should be mostly idle, only reconciling when watched resources change externally.
Actual Behavior
The controller reconciles continuously every ~1 second in a tight loop:
"reconcile starting" → "handling finalizers" → "getting installed bundle" → "resolving bundle" →
"unpacking resolved bundle" → "applying bundle contents" → "watching managed objects" →
"reconcile ending" → immediately "reconcile starting" again
The pod consumes ~960m CPU continuously. Logs show no errors — every reconcile succeeds.
Root Cause Analysis
The ApplyBundle step calls a.Apply() on every reconcile, which performs a server-side apply of all managed objects and re-establishes watches. For large bundles like knative-operator (which manages dozens of CRDs, ClusterRoles, Deployments, etc.), the act of applying/watching these objects generates watch events that immediately re-trigger the next reconcile.
Key observations from the source code (clusterextension_reconcile_steps.go on main):
- No short-circuit in
ApplyBundle — there is no check to skip Apply() when the bundle is unchanged and already fully applied
- No requeue delay —
ReconcileSteps.Reconcile() returns ctrl.Result{} (immediate requeue on watch events), never ctrl.Result{RequeueAfter: ...}
UnpackBundle has a cache optimization (bundleUnchanged check) but ApplyBundle does not have an equivalent
Suggested Fix
One or more of:
- Skip
Apply() when bundle is unchanged and fully installed — if rolloutSucceeded was already true in a previous reconcile and the resolved bundle hasn't changed, skip the apply step entirely
- Add a
RequeueAfter delay after a successful reconcile with no changes, to break the tight loop (e.g., ctrl.Result{RequeueAfter: 5 * time.Minute})
- Deduplicate watch events — avoid re-establishing watches on every reconcile if the set of managed objects hasn't changed
Impact
- ~1 CPU core wasted per ClusterExtension with a large bundle
- Increased API server load from continuous server-side applies
- Unnecessary network traffic and etcd writes
Bug Report
Description
The operator-controller enters a tight continuous reconcile loop when managing a ClusterExtension with a large bundle (e.g.,
knative-operator), consuming ~1 full CPU core indefinitely even when the bundle is fully installed and no changes are needed.Environment
mainbranch / v1.7.0 source)knative-operatorv1.18.0 (large bundle with many CRDs, ClusterRoles, Deployments)Steps to Reproduce
knative-operator:Installed: True,Progressing: True,reason: Succeeded)Expected Behavior
Once the bundle is fully installed and no changes are pending, the controller should be mostly idle, only reconciling when watched resources change externally.
Actual Behavior
The controller reconciles continuously every ~1 second in a tight loop:
The pod consumes ~960m CPU continuously. Logs show no errors — every reconcile succeeds.
Root Cause Analysis
The
ApplyBundlestep callsa.Apply()on every reconcile, which performs a server-side apply of all managed objects and re-establishes watches. For large bundles likeknative-operator(which manages dozens of CRDs, ClusterRoles, Deployments, etc.), the act of applying/watching these objects generates watch events that immediately re-trigger the next reconcile.Key observations from the source code (
clusterextension_reconcile_steps.goonmain):ApplyBundle— there is no check to skipApply()when the bundle is unchanged and already fully appliedReconcileSteps.Reconcile()returnsctrl.Result{}(immediate requeue on watch events), neverctrl.Result{RequeueAfter: ...}UnpackBundlehas a cache optimization (bundleUnchangedcheck) butApplyBundledoes not have an equivalentSuggested Fix
One or more of:
Apply()when bundle is unchanged and fully installed — ifrolloutSucceededwas already true in a previous reconcile and the resolved bundle hasn't changed, skip the apply step entirelyRequeueAfterdelay after a successful reconcile with no changes, to break the tight loop (e.g.,ctrl.Result{RequeueAfter: 5 * time.Minute})Impact