Skip to content

feat(CC-0090): Add NetworkPolicy for operator pod egress restriction#268

Open
berendt wants to merge 10 commits intomainfrom
feature/CC-0090
Open

feat(CC-0090): Add NetworkPolicy for operator pod egress restriction#268
berendt wants to merge 10 commits intomainfrom
feature/CC-0090

Conversation

@berendt
Copy link
Copy Markdown
Contributor

@berendt berendt commented Apr 23, 2026

GitHub Issue #127: Add NetworkPolicy for operator pod egress restriction

Category: security | Scope: Small

Description: The NetworkPolicy sub-reconciler creates policies for Keystone API pods but not for the operator Deployment itself. Add a NetworkPolicy that restricts the operator pod's egress to only the Kubernetes API server and the namespaces it needs to communicate with.

Motivation: The operator pod currently has unrestricted network egress. If compromised, it could be used to reach arbitrary destinations. Restricting egress to only necessary endpoints reduces the blast radius.

Affected Areas (chart-level, opt-in via networkPolicy.enabled=true — CC-0090):

  • operators/keystone/helm/keystone-operator/Chart.yaml — version bump 0.2.0 → 0.3.0 + changelog annotation
  • operators/keystone/helm/keystone-operator/templates/networkpolicy.yaml — new NetworkPolicy template (REQ-001 — REQ-008)
  • operators/keystone/helm/keystone-operator/values.yaml — new top-level networkPolicy key with kind defaults commented out (REQ-005)
  • operators/keystone/helm/keystone-operator/values.schema.json — CIDR/port validation and enabled=truekubeApiServer.{cidrs,ports} required fail-closed clause (REQ-005, REQ-006)
  • operators/keystone/helm/keystone-operator/tests/networkpolicy_test.yaml — helm-unittest coverage for all render branches
  • operators/keystone/helm/keystone-operator/tests/schema_validation_test.yaml — schema-level validation cases for the new key
  • tests/e2e/keystone-operator/network-policy-egress/ — chainsaw E2E installing a second Helm release with networkPolicy.enabled=true (REQ-010, REQ-011); note: kindnet does not enforce NetworkPolicy, so this covers chart rendering + install-under-policy only, not packet-level enforcement
  • tests/e2e-chaos/operator-pod-crash/chainsaw-test.yaml — adjusted to coexist with the new restricted release
  • docs/reference/keystone-operator-networkpolicy.md, docs/how-to/enable-keystone-operator-networkpolicy.md, docs/reference/keystone-reconciler.md, docs/.vitepress/config.ts — reference + how-to docs and nav wiring (REQ-012, REQ-013)

Note: the per-CR reconciler operators/keystone/internal/controller/reconcile_networkpolicy.go is not modified — this PR hardens the operator pod itself via a chart-level NetworkPolicy, which is orthogonal to the per-Keystone-CR NetworkPolicies that reconciler continues to manage.

Acceptance Criteria:

  • NetworkPolicy restricts operator pod egress to kube-apiserver CIDR
  • DNS egress to kube-dns is permitted
  • Operator functions correctly with the restricted policy
  • E2E tests pass with the NetworkPolicy applied (chart-render + install-under-policy scope; see caveat above)

Source: #127

berendt added a commit that referenced this pull request Apr 23, 2026
sourcery-ai[bot]

This comment was marked as off-topic.

@berendt
Copy link
Copy Markdown
Contributor Author

berendt commented Apr 23, 2026

Reviewed by planwerk-review a716ce1 with Claude Code

WARNING (2)

W-001: Chart.yaml artifacthub.io/changes not updated for CC-0090 and chart version not bumped

File: operators/keystone/helm/keystone-operator/Chart.yaml:7-13Fix: AUTO-FIX — Confidence: verified — Pattern: Helm Chart Structure Conventions

Problem: CC-0090 introduces a significant new chart feature (opt-in NetworkPolicy for operator pod hardening) and a new top-level values key (networkPolicy). The artifacthub.io/changes annotation is not updated, so Artifact Hub consumers will not see the new feature advertised. The chart version (0.2.0) is also not bumped despite the chart surface growing with a new top-level key.

Code:

version: 0.2.0
appVersion: "0.1.0"
# CC-0069, REQ-007: Helm validates values against values.schema.json automatically.
annotations:
  artifacthub.io/prerelease: "true"
  artifacthub.io/changes: |
    - kind: added
      description: JSON Schema validation for chart values (CC-0069)

Action Required: Add a new changelog entry for CC-0090 to artifacthub.io/changes and bump the chart version from 0.2.0 to 0.3.0 since the chart surface grew (new top-level key).

Suggested Fix:

version: 0.3.0
appVersion: "0.1.0"
# CC-0069, REQ-007: Helm validates values against values.schema.json automatically.
annotations:
  artifacthub.io/prerelease: "true"
  artifacthub.io/changes: |
    - kind: added
      description: JSON Schema validation for chart values (CC-0069)
    - kind: added
      description: Opt-in NetworkPolicy for operator pod egress/ingress restriction (CC-0090)

W-002: E2E test cannot exercise NetworkPolicy enforcement on kindnet CI cluster

File: tests/e2e/keystone-operator/network-policy-egress/chainsaw-test.yaml:30-99Fix: ASK — Confidence: verified — Pattern: Detect no-op retry/wait loops; Verify benchmarks measure what acceptance criteria claim

Problem: The CI kind cluster uses default kindnet, which does not enforce NetworkPolicy. The E2E test only asserts that (a) the NetworkPolicy resource is created and (b) a Keystone CR reaches Ready=True. On kindnet these pass vacuously with respect to enforcement — the policy is rendered but has no effect, so cidrs: [10.96.0.1/32] is never actually tested for correctness. Moreover, 10.96.0.1/32 is the in-cluster Service VIP; on Calico/Cilium the packet is DNAT'd before policy evaluation, meaning this CIDR would be wrong on most enforcing CNIs. Additionally, the two-operator leader-election topology (primary installed by hack/ci-deploy-operator.sh, secondary by the test) competes for the same LeaderElectionID across different namespaces, so there is no guarantee the restricted operator is the one reconciling the CR in Step 4.

Code:

networkPolicy:
  enabled: true
  kubeApiServer:
    cidrs:
      - 10.96.0.1/32   # Service VIP, not actual endpoint
    ports:
      - 6443

Action Required: Choose between: (A) add an explicit assertion in Step 4 that the keystone-operator-netpol pod is the one that transitioned the CR to Ready (e.g. scrape operator logs for the CR UID); (B) accept the limitation and add an inline comment at the top of chainsaw-test.yaml stating the test validates chart rendering/Helm install, not NetworkPolicy enforcement (requires Calico/Cilium); or (C) file a follow-up ticket to provision a Calico-enabled kind cluster in a dedicated CI job for enforcement tests.

Suggested Fix:

Add an explicit comment at the top of chainsaw-test.yaml clarifying that this test validates chart rendering and Helm install under kindnet (non-enforcing CNI), not actual NetworkPolicy enforcement; optionally add a Step 4 assertion that the restricted operator pod is the one reconciling the CR, and open a follow-up ticket for a Calico-enabled kind CI job.

INFO (2)

I-001: PR description 'Affected Areas' references unmodified reconciler file

File: PR descriptionFix: ASK — Confidence: verified

Problem: The PR description lists operators/keystone/internal/controller/reconcile_networkpolicy.go under 'Affected Areas', but the PR does not modify that file. The author chose a chart-level approach (Helm template) instead of extending the per-CR reconciler (CC-0039). This is a reasonable and explicitly justified choice (documented in the reference doc), but the PR description misrepresents the final scope.

Code:

**Affected Areas:** `operators/keystone/internal/controller/reconcile_networkpolicy.go`

Action Required: Update the PR description's 'Affected Areas' section to reflect the actual chart-level scope: operators/keystone/helm/keystone-operator/templates/networkpolicy.yaml, tests/, values.{yaml,schema.json}, and tests/e2e/keystone-operator/network-policy-egress/.

Suggested Fix:

Rewrite the 'Affected Areas' section to list: operators/keystone/helm/keystone-operator/templates/networkpolicy.yaml, operators/keystone/helm/keystone-operator/values.yaml, operators/keystone/helm/keystone-operator/values.schema.json, tests/e2e/keystone-operator/network-policy-egress/, plus related unit/chainsaw test files.

I-002: IPv6 CIDR schema pattern is structurally lax

File: operators/keystone/helm/keystone-operator/values.schema.json:21-24Fix: ASK — Confidence: verified

Problem: The IPv4 half of the CIDR regex is rigorous (octets 0-255, prefix 0-32), but the IPv6 half is [0-9a-fA-F:]+, which accepts any non-empty run of hex digits and colons. Malformed forms like : alone or 1:2:3:4:5:6:7:8:9/32 (too many groups) pass. The schema comment frames this as 'surface obvious typos' rather than full validation, so it is documented intent — but the stated goal of catching issues at helm render time rather than opaque kubectl admission errors is only partially achieved for IPv6.

Code:

"cidr": {
  "description": "IPv4 or IPv6 CIDR block (e.g., '10.96.0.1/32', '2001:db8::/32'). IPv4 octets are bounded 0-255 to surface obvious typos (e.g. '999.0.0.1/32') at helm render time rather than as opaque kubectl apply admission errors (CC-0090, REQ-005).",
  "type": "string",
  "pattern": "^(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/([0-9]|[12][0-9]|3[0-2])|[0-9a-fA-F:]+/([0-9]|[1-9][0-9]|1[01][0-9]|12[0-8]))$"
}

Action Required: Consider tightening the IPv6 half of the pattern in a future change to reject malformed forms (too many groups, lone colon, etc.). Not a defect; noted for future tightening.

Suggested Fix:

In a future iteration, replace the IPv6 half of the pattern with a stricter alternation that enforces valid group counts (e.g. a standard RFC-4291-compatible IPv6 regex), or document that IPv6 is intentionally loosely validated and rely on kubectl admission for final checks.

Summary

High-quality PR with a well-reasoned chart-level design (instead of per-CR reconciler), thorough schema and helm-unittest coverage with comprehensive fail-closed defense-in-depth guards, and exemplary documentation (reference doc, how-to, cross-references, sidebar updates). The most substantive gap is that the Chainsaw E2E test cannot exercise actual NetworkPolicy enforcement on kindnet, making it closer to a smoke test — this matches CC-0039 precedent and is acknowledged in the how-to, but the test itself would benefit from explicit comment or a CNI follow-up. Minor cleanups remain: the Chart.yaml artifacthub.io/changes annotation and chart version bump were missed, and the PR description's 'Affected Areas' names a reconciler file that wasn't modified.

Severity Count
BLOCKING 0
CRITICAL 0
WARNING 2
INFO 2

Important

Recommendation: Merge after applying the auto-fix for Chart.yaml (changelog annotation + version bump) and updating the PR description. The E2E-enforcement gap and IPv6 schema laxity are acceptable as-is given documented intent and precedent, but should be tracked as follow-ups.

berendt added a commit that referenced this pull request Apr 23, 2026
berendt added a commit that referenced this pull request Apr 24, 2026
berendt added 10 commits April 24, 2026 20:38
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Introduce the configuration surface for the operator-pod
NetworkPolicy (Level 1). Template rendering, fail-closed
guards, unit tests, E2E tests, and docs arrive in later
levels.

Why: opt-in NetworkPolicy hardening requires a schema-
validated values contract before the template can be safely
authored. Validating CIDRs, ports, and boolean types at
helm-install time turns malformed input into a clear error
instead of a rejected Kubernetes manifest later.

Tasks completed at Level 1:
- 1.1 values.yaml: add networkPolicy block with
  enabled=false default, empty kubeApiServer.cidrs/ports,
  DNS defaults (kube-system + k8s-app=kube-dns, enabled),
  empty allowMetricsFrom, empty webhookClients.cidrs
  override. Kind defaults documented as comments, not
  applied (REQ-001, REQ-005).
- 1.2 values.schema.json: add cidr and stringMap
  definitions; add networkPolicy object with strict
  additionalProperties=false, boolean enabled, CIDR-pattern
  validated arrays, integer ports bounded to [1, 65535],
  stringMap-typed selector maps, and NetworkPolicyPeer array
  for allowMetricsFrom. Restructure the top-level
  conditional into allOf so the existing rbac/webhook
  constraint is preserved alongside a new fail-closed
  conditional that requires non-empty kubeApiServer.cidrs
  and kubeApiServer.ports whenever networkPolicy.enabled is
  true (REQ-005, REQ-006).

Defaults preserve the opt-in invariant: with no overrides,
values.yaml validates cleanly and no NetworkPolicy will
render. Setting enabled=true without populating both CIDRs
and ports is rejected at schema time — defense-in-depth
alongside the template-level {{ fail }} guard planned for
Level 2.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Render a default-deny NetworkPolicy for the keystone-operator pod when
networkPolicy.enabled=true, opening only the explicit rules required for
the operator to reconcile Keystone CRs. Default-off preserves upgrade
compatibility for clusters whose kube-apiserver CIDR or DNS layout does
not match the defaults.

Rules rendered (CC-0090, Level 2):
- Egress to the configured kube-apiserver CIDRs/ports as a single rule
  (N CIDRs x M ports via NetworkPolicy cartesian semantics, REQ-004)
- Egress UDP+TCP/53 to the configurable DNS peer, gated by
  networkPolicy.dns.enabled (REQ-003)
- Ingress TCP/9443 for the admission webhook from webhookClients.cidrs,
  falling back to kubeApiServer.cidrs when unset (REQ-007)
- Ingress on metrics.port from networkPolicy.allowMetricsFrom peers,
  rendered verbatim; skipped when empty (REQ-008)

Fail-closed guards (REQ-006) mirror the defensive check at
internal/controller/reconcile_networkpolicy.go:61-63: when enabled=true
with empty kubeApiServer.cidrs or empty kubeApiServer.ports, the
template aborts via {{ fail }} rather than render an unsafe rule. The
JSON schema already enforces minItems: 1 under allOf, but schema
validation can be bypassed (--skip-schema-validation) so the template
must defend itself.

Health probes on 8081 are intentionally not covered: kubelet calls
probes from the host network namespace and standard CNIs (Calico,
Cilium, Antrea) do not evaluate NetworkPolicy for that traffic
(REQ-009). A template comment documents this decision.

Level 2 tasks:
- 2.1 Scaffold + metadata/labels + podSelector + policyTypes
- 2.2 Fail-closed guards on empty cidrs/ports
- 2.3 DNS egress rule
- 2.4 kube-apiserver egress rule (single combined rule)
- 2.5 Webhook ingress rule with webhookClients fallback
- 2.6 Metrics ingress rule
- 2.7 Health-probe documentation comment

Verified with helm template against real helm binary: default-off
emits no NetworkPolicy; enabled+guards trigger on empty input;
enabled+valid input produces the expected YAML across DNS on/off,
webhook on/off, single- and multi-CIDR/port, and custom webhookClients
scenarios. helm lint is clean.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Author tests/networkpolicy_test.yaml covering the rendered NetworkPolicy
shape, fail-closed guards, and absence of the health-probe ingress rule,
and extend tests/schema_validation_test.yaml with networkPolicy schema
constraints.

Level 3 tasks:
- 3.1 default-off count=0, enabled-on count=1 + labels, policyTypes
  equality (REQ-001, REQ-002)
- 3.2 DNS rule: default selectors, overrides, dns.enabled=false drops
  the rule (REQ-003)
- 3.3 kube-apiserver rule: single CIDR+port, multiple CIDRs, multiple
  ports — one rule combining the lists (REQ-004)
- 3.4 Webhook ingress (enabled/disabled, custom webhookClients) and
  metrics ingress (empty, one peer, multiple peers, configurable port)
  (REQ-007, REQ-008)
- 3.5 Negative paths: empty cidrs/ports trigger failedTemplate; no
  ingress rule for port 8081 is ever rendered (REQ-006, REQ-009)
- 3.6 Schema: invalid CIDR, CIDR with /33 prefix, port>65535, port<1,
  non-boolean enabled, unknown property, and empty cidrs/ports under
  the enabled=true conditional are all rejected; a fully configured
  networkPolicy is accepted (REQ-005)

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Add a Chainsaw E2E test under tests/e2e/keystone-operator/ that
installs the keystone-operator Helm chart with
networkPolicy.enabled=true (kind CIDR 10.96.0.1/32:6443), asserts
the rendered NetworkPolicy is present in the release namespace, and
verifies a minimal Keystone CR reconciles to Ready=True under the
restricted egress policy. A catch block dumps current+previous
operator logs, the NetworkPolicy YAML, CR status, and namespace
events to make CI egress regressions debuggable without a local
reproduction (REQ-010, REQ-011).

The new tests/e2e/<operator>-operator/ directory is wired into the
existing e2e-operator job by extending the chainsaw invocation to
include it when present, keeping chart-level tests separate from
the per-CR tests under tests/e2e/<operator>/. matrix.operator is
piped through an OPERATOR env var to satisfy the workflow-injection
guardrail. Operators without a chart-level test directory are
unaffected.

Completes Level 4 tasks 4.1, 4.2, and 4.3 of CC-0090.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Level 5 — documentation for the chart-level NetworkPolicy that
hardens the keystone-operator pod (CC-0090).

- 5.1 Add docs/reference/keystone-operator-networkpolicy.md with
  overview, default-off rationale, egress/ingress rule tables
  (direction/protocol/ports/peer/values-key/default/gated-by),
  kind values snippet, fail-closed guard explanation, and link
  to values.schema.json (REQ-012).
- 5.2 Add docs/how-to/enable-keystone-operator-networkpolicy.md
  with prerequisites (CNI enforcement), numbered enablement
  steps (gather apiserver CIDR, update values, helm upgrade),
  verification (describe networkpolicy, reconcile check,
  chainsaw E2E), and troubleshooting for reconcile timeouts,
  webhook TLS, DNS, metrics scraping, and fail-closed render
  errors (REQ-013).
- 5.3 Wire the new reference and how-to into the VitePress
  sidebar and add a top-of-doc back-link from the Keystone
  Reconciler reference that distinguishes CC-0039 (per-CR
  NetworkPolicy) from CC-0090 (operator-pod NetworkPolicy).
  Decision comment notes that docs/index.md is a hero page
  without a reference list, so the sidebar is the functional
  equivalent of "the reference section" (REQ-012).

VitePress build passes; all internal anchors resolve.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Addresses W-001 (Chart.yaml version + changelog), W-002 (chainsaw scope comment), and I-002 (schema docstring).

- operators/keystone/helm/keystone-operator/Chart.yaml: Bumped chart version from 0.2.0 to 0.3.0 to reflect the new top-level `networkPolicy` key, and added an `artifacthub.io/changes` entry advertising the opt-in NetworkPolicy feature to Artifact Hub consumers (W-001).
- operators/keystone/helm/keystone-operator/values.schema.json: Expanded the `cidr` definition description to explicitly document that the IPv6 half of the regex is intentionally loose — it does not enforce RFC 4291 group counts or `::` rules — and that final IPv6 validation is delegated to kubectl/apiserver admission, so readers understand the schema's stated typo-catching goal is only partially achieved for IPv6 (I-002).
- tests/e2e/keystone-operator/network-policy-egress/chainsaw-test.yaml: Added a prominent "SCOPE CAVEAT" comment block at the top of the test clarifying that kindnet silently ignores NetworkPolicy, so this test validates only chart rendering and operator reconciliation under an unenforced policy rather than actual packet-level egress enforcement; also called out the Service-VIP DNAT issue on Calico/Cilium and the leader-election ambiguity introduced by running a second operator release alongside the primary one, and noted a Calico-enabled kind CI job as the follow-up that would make this test assert on blocked egress. Updated a downstream comment to reference the "(unenforced) egress policy" for consistency (W-002).

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
@berendt berendt requested a review from gndrmnn April 24, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant