feat: NodeUpdate plans for image rollout orchestration#94
Merged
Conversation
When a Running node's spec.image differs from status.currentImage, the
planner detects the drift and builds a NodeUpdate plan to roll out the
update.
NodeUpdate plan tasks:
1. apply-statefulset — pushes the new image to the StatefulSet via SSA
2. apply-service — keeps the headless Service in sync
3. observe-image — polls StatefulSet rollout until complete, stamps
status.currentImage in-memory
4. mark-ready — re-initializes the sidecar on the new pod
Planner condition management:
- ResolvePlan sets NodeUpdateInProgress=True when building the plan
- On plan completion: clears with Reason=UpdateComplete
- On plan failure: clears with Reason=UpdateFailed, then immediately
retries (builds a new plan since drift persists)
- handleTerminalPlan clears completed/failed plans before building
the next one
Drift-gated plans:
- No drift (spec.image == status.currentImage) → no plan, steady state
- Image drift → NodeUpdate plan
Reconciler simplification:
- Removed observeCurrentImage method — replaced by observe-image task
- Running-phase post-plan block removed — fully plan-driven
- Deleted 7 observeCurrentImage unit tests (coverage moved to task tests)
New tests:
- observe-image task: rollout complete, rolling update, stale generation,
StatefulSet not found
- NodeUpdate planner: drift detection, plan composition, condition
lifecycle (start, complete, fail/retry), terminal plan handling
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fecycle Fixes from adversarial review by Tide specialists: CRITICAL: clearCompletedConvergencePlan was nilling NodeUpdate plans, preventing handleTerminalPlan from ever observing the completed plan and clearing NodeUpdateInProgress. Now renamed to clearCompletedPlanIfSafe and checks isNodeUpdatePlan — NodeUpdate plans stay Complete for one reconcile so the planner can observe them and clear conditions. IMPORTANT: Document observe-image's Status() mutation of currentImage as a sanctioned exception to the "tasks own resources" rule, explaining why the mutation must happen at observation time. MINOR: Fix stale "convergence plans" reference in isNodeUpdatePlan comment. Update test to verify NodeUpdate plans stay Complete (not nilled). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bdchatham
commented
Apr 16, 2026
…lans Remove clearCompletedPlanIfSafe (formerly clearCompletedConvergencePlan) from the executor entirely. The executor now only marks plans Complete or Failed — it never nils them. The planner's handleTerminalPlan handles ALL plan cleanup: clearing conditions and nilling terminal plans. This is a cleaner separation: the executor drives tasks to completion, the planner decides what to do with the result. No type-specific checks (isNodeUpdatePlan) in the executor. No plan lifecycle management outside the planner. Also removes the now-unused currentPhase helper from the executor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address PR feedback: Execute() now polls the StatefulSet and stamps currentImage when the rollout completes. Status() just returns the cached result via DefaultStatus(). This keeps Execute as the action method and Status as the query method. When the rollout is not yet complete, Execute returns nil (no error). The executor will re-invoke on the next reconcile since the task remains Pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bdchatham
commented
Apr 16, 2026
bdchatham
commented
Apr 16, 2026
bdchatham
commented
Apr 16, 2026
Move condition setting from post-processing inference to direct mutation at the point of decision: - buildNodeUpdatePlan sets NodeUpdateInProgress=True directly on the node when it builds the plan — no separate applyPlanStartConditions step that scans tasks to infer the plan type - handleTerminalPlan sets NodeUpdateInProgress=False via the shared setNodeUpdateCondition helper — no clearNodeUpdateCondition that checked whether the condition was already set Deleted: isNodeUpdatePlan, applyPlanStartConditions, clearNodeUpdateCondition. These inferred plan type from task contents and had special-case logic for condition state. The new approach is simpler: the code that creates the scenario sets the condition. The code that observes the outcome clears it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bdchatham
commented
Apr 16, 2026
handleTerminalPlan was setting NodeUpdateInProgress=False for ALL terminal plans, including init plans that never had the condition set. This would add a spurious False condition after init completion. Now checks hasNodeUpdateCondition before clearing — only clears the condition if it was previously set to True by buildNodeUpdatePlan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the mark-ready sidecar task type constant to the task package (re-exported from sidecar) so buildNodeUpdatePlan references all task types via task.TaskType* consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a9ebeae to
64ad35b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a Running node's
spec.imagediffers fromstatus.currentImage, the planner detects the drift and builds a NodeUpdate plan to roll out the update. This is the feature that the plan-centric refactor was building toward.NodeUpdate plan
observe-imageis a new async controller-side task that polls the StatefulSet rollout and stampsstatus.currentImagewhen completemark-readyre-initializes the sidecar on the new podCondition lifecycle
NodeUpdateInProgress=True, Reason=UpdateStartedNodeUpdateInProgress=False, Reason=UpdateCompleteNodeUpdateInProgress=False, Reason=UpdateFailed→ immediate retry (drift persists)The
SeiNodeDeploymentcontroller can watchNodeUpdateInProgressto coordinate fleet-wide rollouts.Drift-gated plans
Reconciler simplification
observeCurrentImagemethod removed — replaced byobserve-imagetaskTest plan
make build— compiles cleanlymake test— all tests pass (planner coverage 51.4%, task 33.7%)make lint— zero issues🤖 Generated with Claude Code