feat: NodeUpdate plans for image rollout orchestration by bdchatham · Pull Request #94 · sei-protocol/sei-k8s-controller

bdchatham · 2026-04-16T17:34:55Z

Summary

When a Running node's spec.image differs from status.currentImage, the planner detects the drift and builds a NodeUpdate plan to roll out the update. This is the feature that the plan-centric refactor was building toward.

NodeUpdate plan

apply-statefulset → apply-service → observe-image → mark-ready

observe-image is a new async controller-side task that polls the StatefulSet rollout and stamps status.currentImage when complete
mark-ready re-initializes the sidecar on the new pod

Condition lifecycle

Plan creation: NodeUpdateInProgress=True, Reason=UpdateStarted
Plan completion: NodeUpdateInProgress=False, Reason=UpdateComplete
Plan failure: NodeUpdateInProgress=False, Reason=UpdateFailed → immediate retry (drift persists)

The SeiNodeDeployment controller can watch NodeUpdateInProgress to coordinate fleet-wide rollouts.

Drift-gated plans

No drift → no plan → steady state (no unnecessary API patches)
Image drift → NodeUpdate plan

Reconciler simplification

observeCurrentImage method removed — replaced by observe-image task
Running-phase post-plan block removed — fully plan-driven

Test plan

make build — compiles cleanly
make test — all tests pass (planner coverage 51.4%, task 33.7%)
make lint — zero issues
New tests: observe-image task (4 cases), NodeUpdate planner (15+ cases)
Review: observe-image task polls correctly
Review: condition lifecycle (start/complete/fail/retry)
Review: drift detection in buildRunningPlan

🤖 Generated with Claude Code

When a Running node's spec.image differs from status.currentImage, the planner detects the drift and builds a NodeUpdate plan to roll out the update. NodeUpdate plan tasks: 1. apply-statefulset — pushes the new image to the StatefulSet via SSA 2. apply-service — keeps the headless Service in sync 3. observe-image — polls StatefulSet rollout until complete, stamps status.currentImage in-memory 4. mark-ready — re-initializes the sidecar on the new pod Planner condition management: - ResolvePlan sets NodeUpdateInProgress=True when building the plan - On plan completion: clears with Reason=UpdateComplete - On plan failure: clears with Reason=UpdateFailed, then immediately retries (builds a new plan since drift persists) - handleTerminalPlan clears completed/failed plans before building the next one Drift-gated plans: - No drift (spec.image == status.currentImage) → no plan, steady state - Image drift → NodeUpdate plan Reconciler simplification: - Removed observeCurrentImage method — replaced by observe-image task - Running-phase post-plan block removed — fully plan-driven - Deleted 7 observeCurrentImage unit tests (coverage moved to task tests) New tests: - observe-image task: rollout complete, rolling update, stale generation, StatefulSet not found - NodeUpdate planner: drift detection, plan composition, condition lifecycle (start, complete, fail/retry), terminal plan handling Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…fecycle Fixes from adversarial review by Tide specialists: CRITICAL: clearCompletedConvergencePlan was nilling NodeUpdate plans, preventing handleTerminalPlan from ever observing the completed plan and clearing NodeUpdateInProgress. Now renamed to clearCompletedPlanIfSafe and checks isNodeUpdatePlan — NodeUpdate plans stay Complete for one reconcile so the planner can observe them and clear conditions. IMPORTANT: Document observe-image's Status() mutation of currentImage as a sanctioned exception to the "tasks own resources" rule, explaining why the mutation must happen at observation time. MINOR: Fix stale "convergence plans" reference in isNodeUpdatePlan comment. Update test to verify NodeUpdate plans stay Complete (not nilled). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lans Remove clearCompletedPlanIfSafe (formerly clearCompletedConvergencePlan) from the executor entirely. The executor now only marks plans Complete or Failed — it never nils them. The planner's handleTerminalPlan handles ALL plan cleanup: clearing conditions and nilling terminal plans. This is a cleaner separation: the executor drives tasks to completion, the planner decides what to do with the result. No type-specific checks (isNodeUpdatePlan) in the executor. No plan lifecycle management outside the planner. Also removes the now-unused currentPhase helper from the executor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address PR feedback: Execute() now polls the StatefulSet and stamps currentImage when the rollout completes. Status() just returns the cached result via DefaultStatus(). This keeps Execute as the action method and Status as the query method. When the rollout is not yet complete, Execute returns nil (no error). The executor will re-invoke on the next reconcile since the task remains Pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move condition setting from post-processing inference to direct mutation at the point of decision: - buildNodeUpdatePlan sets NodeUpdateInProgress=True directly on the node when it builds the plan — no separate applyPlanStartConditions step that scans tasks to infer the plan type - handleTerminalPlan sets NodeUpdateInProgress=False via the shared setNodeUpdateCondition helper — no clearNodeUpdateCondition that checked whether the condition was already set Deleted: isNodeUpdatePlan, applyPlanStartConditions, clearNodeUpdateCondition. These inferred plan type from task contents and had special-case logic for condition state. The new approach is simpler: the code that creates the scenario sets the condition. The code that observes the outcome clears it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

handleTerminalPlan was setting NodeUpdateInProgress=False for ALL terminal plans, including init plans that never had the condition set. This would add a spurious False condition after init completion. Now checks hasNodeUpdateCondition before clearing — only clears the condition if it was previously set to True by buildNodeUpdatePlan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move the mark-ready sidecar task type constant to the task package (re-exported from sidecar) so buildNodeUpdatePlan references all task types via task.TaskType* consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bdchatham and others added 2 commits April 16, 2026 10:33