Skip to content

OADP-7541: fix non-deterministic matchExpressions ordering causing node-agent restarts#2234

Open
kaovilai wants to merge 4 commits into
openshift:oadp-devfrom
kaovilai:OADP-7541-fix-nodeagent-matchexpressions-ordering
Open

OADP-7541: fix non-deterministic matchExpressions ordering causing node-agent restarts#2234
kaovilai wants to merge 4 commits into
openshift:oadp-devfrom
kaovilai:OADP-7541-fix-nodeagent-matchexpressions-ordering

Conversation

@kaovilai

@kaovilai kaovilai commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

  • When a DPA configures nodeAgent.podConfig.nodeSelector with multiple labels, kube.ToSystemAffinity() iterates Go maps in random order to build matchExpressions. Kubernetes treats arrays as ordered, so each order flip registers as a spec change, triggering DaemonSet rollouts every reconciliation (~30s). Customer observed generation 13858 on node-agent while velero deployment stayed at 2.
  • Adds common.SortNodeSelectorTerms() helper that sorts MatchExpressions by key within each NodeSelectorTerm
  • Applied to both node-agent DaemonSet and Velero Deployment code paths
  • Unit tests run 100 iterations with multi-label selectors to verify deterministic ordering

Test plan

  • New unit tests pass: TestNodeAgentAffinityMatchExpressionsDeterministicOrder (3 sub-cases × 100 iterations)
  • New unit tests pass: TestVeleroAffinityMatchExpressionsDeterministicOrder (100 iterations)
  • Tests run with -count=5 for additional confidence
  • All files compile cleanly
  • E2E: deploy with multi-label nodeSelector, verify node-agent generation stays stable

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Node affinity match expressions for NodeAgent and Velero are now deterministically ordered, preventing non-deterministic pod/deployment spec diffs and ensuring consistent scheduling.
  • Tests

    • Added unit tests that repeatedly validate deterministic ordering of node affinity match expressions across multi-label scenarios to guard against nondeterminism.

…de-agent restarts

When a DPA configures nodeAgent.podConfig.nodeSelector with multiple
labels, kube.ToSystemAffinity() iterates Go maps in random order to
build matchExpressions. Kubernetes treats arrays as ordered, so each
order flip registers as a spec change, triggering DaemonSet rollouts
every reconciliation (~30s). Customer observed generation 13858 on
node-agent while velero deployment stayed at 2.

Sort matchExpressions by key after building them from LoadAffinityConfig
to ensure deterministic ordering. Applied to both node-agent DaemonSet
and Velero Deployment code paths.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 8, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 8, 2026

Copy link
Copy Markdown

@kaovilai: This pull request references OADP-7541 which is a valid jira issue.

Details

In response to this:

Summary

  • When a DPA configures nodeAgent.podConfig.nodeSelector with multiple labels, kube.ToSystemAffinity() iterates Go maps in random order to build matchExpressions. Kubernetes treats arrays as ordered, so each order flip registers as a spec change, triggering DaemonSet rollouts every reconciliation (~30s). Customer observed generation 13858 on node-agent while velero deployment stayed at 2.
  • Adds common.SortNodeSelectorTerms() helper that sorts MatchExpressions by key within each NodeSelectorTerm
  • Applied to both node-agent DaemonSet and Velero Deployment code paths
  • Unit tests run 100 iterations with multi-label selectors to verify deterministic ordering

Test plan

  • New unit tests pass: TestNodeAgentAffinityMatchExpressionsDeterministicOrder (3 sub-cases × 100 iterations)
  • New unit tests pass: TestVeleroAffinityMatchExpressionsDeterministicOrder (100 iterations)
  • Tests run with -count=5 for additional confidence
  • All files compile cleanly
  • E2E: deploy with multi-label nodeSelector, verify node-agent generation stays stable

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d17cc83-e233-409c-a684-1b7d35aeebb0

📥 Commits

Reviewing files that changed from the base of the PR and between 7c88cb6 and 516587c.

📒 Files selected for processing (1)
  • pkg/common/common.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/common/common.go

Walkthrough

Sort NodeSelectorTerm MatchExpressions/MatchFields by requirement key via new common.SortNodeSelectorTerms; apply the sorter in NodeAgent and Velero affinity assembly and add tests asserting deterministic ordering.

Changes

Deterministic node affinity ordering

Layer / File(s) Summary
Sorting utility for node selector terms
pkg/common/common.go
Add cmp and slices imports and implement SortNodeSelectorTerms to sort each corev1.NodeSelectorTerm's MatchExpressions and MatchFields by requirement Key.
NodeAgent DaemonSet affinity determinism
internal/controller/nodeagent.go, internal/controller/nodeagent_test.go
Call common.SortNodeSelectorTerms(terms) in ReconcileNodeAgentDaemonset before applying to DaemonSet affinity; add TestNodeAgentAffinityMatchExpressionsDeterministicOrder to verify sorted ordering across multiple label scenarios run 100 times each.
Velero Deployment affinity determinism
internal/controller/velero.go, internal/controller/velero_test.go
Call common.SortNodeSelectorTerms(terms) in customizeVeleroDeployment before applying to pod affinity; import testify/require, adjust expected MatchExpressions ordering in an existing complex-case test, and add TestVeleroAffinityMatchExpressionsDeterministicOrder to assert deterministic ordering across iterations.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing non-deterministic matchExpressions ordering that caused node-agent restarts, directly matching the changeset's core objective.
Description check ✅ Passed The PR description provides comprehensive context on the problem, the solution, and testing approach, covering both required template sections with specific technical details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are static and deterministic with no dynamic information (UUIDs, timestamps, generated suffixes, pod/node names, or IPs). Test titles clearly describe what they validate.
Test Structure And Quality ✅ Passed Custom check requires review of Ginkgo test code, but PR adds standard Go tests using testing.T pattern, not Ginkgo (no Describe/It/BeforeEach/AfterEach blocks). Check not applicable.
Microshift Test Compatibility ✅ Passed The added tests are standard Go unit tests (using testing.T), not Ginkgo e2e tests. The check applies only to Ginkgo e2e tests; therefore it is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR adds standard Go unit tests (not Ginkgo e2e tests). The check applies only to Ginkgo tests (It(), Describe(), Context(), When()), so it does not apply here.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds deterministic sorting to user-provided LoadAffinityConfig. No new scheduling constraints, hardcoded node selectors, or topology assumptions introduced.
Ote Binary Stdout Contract ✅ Passed PR adds deterministic sorting logic and tests with no process-level stdout writes. All changes are contained within regular functions and test code that doesn't violate OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only standard Go unit tests (func Test*), not Ginkgo e2e tests. Tests operate on in-memory Kubernetes objects with no IPv4 assumptions or external connectivity requirements.
No-Weak-Crypto ✅ Passed PR adds sorting of Kubernetes NodeSelectorTerms using standard library functions; no weak crypto patterns, custom crypto implementations, or insecure secret comparisons detected.
Container-Privileges ✅ Passed PR introduces only NodeSelectorTerms sorting logic for determinism; no privileged, hostPID, hostNetwork, hostIPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation settings are added.
No-Sensitive-Data-In-Logs ✅ Passed No new logging statements that expose sensitive data were added. The PR adds only sorting function calls with no logging or debug output.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from mrnold and sseago June 8, 2026 20:54
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaovilai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 8, 2026
@kaovilai

kaovilai commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

/cherry-pick oadp-1.6

@openshift-cherrypick-robot

Copy link
Copy Markdown
Contributor

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.6 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick oadp-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kaovilai

kaovilai commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

/cherry-pick oadp-1.5

@openshift-cherrypick-robot

Copy link
Copy Markdown
Contributor

@kaovilai: once the present PR merges, I will cherry-pick it on top of oadp-1.5 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick oadp-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kaovilai and others added 2 commits June 9, 2026 12:10
Sort expected matchExpressions alphabetically by key to match the
deterministic ordering introduced in the previous commit.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai

kaovilai commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

/retest-required

Note

Responses generated with Claude

5.0-e2e-test-aws failed due to infrastructure flake — test pod received SIGTERM 1s after startup (exit code 130), no tests executed. All other E2E jobs passed.

terms = append(terms, a.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms...)
}
}
common.SortNodeSelectorTerms(terms)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered filing an upstream Velero issue (or PR) to sort there? That would fix it for all consumers, not just OADP. This workaround is fine to merge now, but an upstream fix would let us eventually drop it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Comment thread pkg/common/common.go
// Go map iteration randomness in upstream kube.ToSystemAffinity
// produces non-deterministic matchExpressions order, which Kubernetes
// treats as a spec change and triggers unnecessary DaemonSet rollouts.
func SortNodeSelectorTerms(terms []corev1.NodeSelectorTerm) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortNodeSelectorTerms only sorts MatchExpressions but the name suggests it handles the full NodeSelectorTerm. Two options:

  1. Also sort MatchFields for completeness (same pattern, one extra loop).
  2. Rename to something like SortMatchExpressions to match what it actually does.

I'd lean toward option 1 since it's trivial and future-proofs us if MatchFields ever comes into play.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added MatchFields sorting too in 516587c. Extracted the comparator to avoid duplication.

Note

Responses generated with Claude

Address review feedback: sort both MatchExpressions and MatchFields
by key for completeness and future-proofing.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown

@kaovilai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.23-e2e-test-aws 516587c link false /test 4.23-e2e-test-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants