Skip to content

feat: NodeUpdate plans for image rollout orchestration#94

Merged
bdchatham merged 7 commits intomainfrom
feat/node-update-plans
Apr 16, 2026
Merged

feat: NodeUpdate plans for image rollout orchestration#94
bdchatham merged 7 commits intomainfrom
feat/node-update-plans

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented Apr 16, 2026

Summary

When a Running node's spec.image differs from status.currentImage, the planner detects the drift and builds a NodeUpdate plan to roll out the update. This is the feature that the plan-centric refactor was building toward.

NodeUpdate plan

apply-statefulset → apply-service → observe-image → mark-ready
  • observe-image is a new async controller-side task that polls the StatefulSet rollout and stamps status.currentImage when complete
  • mark-ready re-initializes the sidecar on the new pod

Condition lifecycle

  • Plan creation: NodeUpdateInProgress=True, Reason=UpdateStarted
  • Plan completion: NodeUpdateInProgress=False, Reason=UpdateComplete
  • Plan failure: NodeUpdateInProgress=False, Reason=UpdateFailed → immediate retry (drift persists)

The SeiNodeDeployment controller can watch NodeUpdateInProgress to coordinate fleet-wide rollouts.

Drift-gated plans

  • No drift → no plan → steady state (no unnecessary API patches)
  • Image drift → NodeUpdate plan

Reconciler simplification

  • observeCurrentImage method removed — replaced by observe-image task
  • Running-phase post-plan block removed — fully plan-driven

Test plan

  • make build — compiles cleanly
  • make test — all tests pass (planner coverage 51.4%, task 33.7%)
  • make lint — zero issues
  • New tests: observe-image task (4 cases), NodeUpdate planner (15+ cases)
  • Review: observe-image task polls correctly
  • Review: condition lifecycle (start/complete/fail/retry)
  • Review: drift detection in buildRunningPlan

🤖 Generated with Claude Code

bdchatham and others added 2 commits April 16, 2026 10:33
When a Running node's spec.image differs from status.currentImage, the
planner detects the drift and builds a NodeUpdate plan to roll out the
update.

NodeUpdate plan tasks:
  1. apply-statefulset — pushes the new image to the StatefulSet via SSA
  2. apply-service — keeps the headless Service in sync
  3. observe-image — polls StatefulSet rollout until complete, stamps
     status.currentImage in-memory
  4. mark-ready — re-initializes the sidecar on the new pod

Planner condition management:
  - ResolvePlan sets NodeUpdateInProgress=True when building the plan
  - On plan completion: clears with Reason=UpdateComplete
  - On plan failure: clears with Reason=UpdateFailed, then immediately
    retries (builds a new plan since drift persists)
  - handleTerminalPlan clears completed/failed plans before building
    the next one

Drift-gated plans:
  - No drift (spec.image == status.currentImage) → no plan, steady state
  - Image drift → NodeUpdate plan

Reconciler simplification:
  - Removed observeCurrentImage method — replaced by observe-image task
  - Running-phase post-plan block removed — fully plan-driven
  - Deleted 7 observeCurrentImage unit tests (coverage moved to task tests)

New tests:
  - observe-image task: rollout complete, rolling update, stale generation,
    StatefulSet not found
  - NodeUpdate planner: drift detection, plan composition, condition
    lifecycle (start, complete, fail/retry), terminal plan handling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fecycle

Fixes from adversarial review by Tide specialists:

CRITICAL: clearCompletedConvergencePlan was nilling NodeUpdate plans,
preventing handleTerminalPlan from ever observing the completed plan
and clearing NodeUpdateInProgress. Now renamed to clearCompletedPlanIfSafe
and checks isNodeUpdatePlan — NodeUpdate plans stay Complete for one
reconcile so the planner can observe them and clear conditions.

IMPORTANT: Document observe-image's Status() mutation of currentImage as
a sanctioned exception to the "tasks own resources" rule, explaining why
the mutation must happen at observation time.

MINOR: Fix stale "convergence plans" reference in isNodeUpdatePlan comment.
Update test to verify NodeUpdate plans stay Complete (not nilled).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread internal/task/observe_image.go
bdchatham and others added 2 commits April 16, 2026 11:07
…lans

Remove clearCompletedPlanIfSafe (formerly clearCompletedConvergencePlan)
from the executor entirely. The executor now only marks plans Complete or
Failed — it never nils them. The planner's handleTerminalPlan handles
ALL plan cleanup: clearing conditions and nilling terminal plans.

This is a cleaner separation: the executor drives tasks to completion,
the planner decides what to do with the result. No type-specific checks
(isNodeUpdatePlan) in the executor. No plan lifecycle management outside
the planner.

Also removes the now-unused currentPhase helper from the executor.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address PR feedback: Execute() now polls the StatefulSet and stamps
currentImage when the rollout completes. Status() just returns the
cached result via DefaultStatus(). This keeps Execute as the action
method and Status as the query method.

When the rollout is not yet complete, Execute returns nil (no error).
The executor will re-invoke on the next reconcile since the task
remains Pending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread internal/planner/planner.go Outdated
Comment thread internal/planner/planner.go
Comment thread internal/planner/planner.go Outdated
Move condition setting from post-processing inference to direct mutation
at the point of decision:

- buildNodeUpdatePlan sets NodeUpdateInProgress=True directly on the
  node when it builds the plan — no separate applyPlanStartConditions
  step that scans tasks to infer the plan type
- handleTerminalPlan sets NodeUpdateInProgress=False via the shared
  setNodeUpdateCondition helper — no clearNodeUpdateCondition that
  checked whether the condition was already set

Deleted: isNodeUpdatePlan, applyPlanStartConditions, clearNodeUpdateCondition.
These inferred plan type from task contents and had special-case logic for
condition state. The new approach is simpler: the code that creates the
scenario sets the condition. The code that observes the outcome clears it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread internal/planner/planner.go Outdated
bdchatham and others added 2 commits April 16, 2026 12:26
handleTerminalPlan was setting NodeUpdateInProgress=False for ALL
terminal plans, including init plans that never had the condition set.
This would add a spurious False condition after init completion.

Now checks hasNodeUpdateCondition before clearing — only clears the
condition if it was previously set to True by buildNodeUpdatePlan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the mark-ready sidecar task type constant to the task package
(re-exported from sidecar) so buildNodeUpdatePlan references all task
types via task.TaskType* consistently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham force-pushed the feat/node-update-plans branch from a9ebeae to 64ad35b Compare April 16, 2026 19:30
@bdchatham bdchatham merged commit ccc7684 into main Apr 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant