Skip to content

feat(CC-0093): Parameterize OpenBao RemoteKey per Keystone CR#280

Open
berendt wants to merge 1 commit intomainfrom
feature/CC-0093
Open

feat(CC-0093): Parameterize OpenBao RemoteKey per Keystone CR#280
berendt wants to merge 1 commit intomainfrom
feature/CC-0093

Conversation

@berendt
Copy link
Copy Markdown
Contributor

@berendt berendt commented Apr 24, 2026

GitHub Issue #279: Parameterize OpenBao RemoteKey per Keystone CR to eliminate cross-CR KV path collisions

Category: bug | Scope: Medium

Description: The Fernet and credential PushSecret builders hardcode the OpenBao RemoteKey to the same two literal paths for every Keystone CR in a cluster:

  • operators/keystone/internal/controller/reconcile_fernet.go:423RemoteKey: "openstack/keystone/fernet-keys"
  • operators/keystone/internal/controller/reconcile_credential.go:429RemoteKey: "openstack/keystone/credential-keys"

Every PushSecret CR — {keystone.Name}-fernet-keys-backup / {keystone.Name}-credential-keys-backup — is itself name-scoped, and the local Secret it selects is name-scoped too (reconcile_fernet.go:417, reconcile_credential.go:423). Only the remote KV-v2 path is flat-namespaced across CRs. Two Keystone CRs in the same Kubernetes namespace therefore share exactly two backing OpenBao objects — whichever CR's ESO push lands last wins, and either CR's deletion can race with the other CR's next push.

In production this is dormant (the conventional deployment runs a single Keystone per namespace), but it is load-bearing in the E2E suite, where tests/e2e/chainsaw-config.yaml:40 sets parallel: 4 and every tests/e2e/keystone/*/chainsaw-test.yaml file lives in the same openstack namespace. Step 6 of tests/e2e/keystone/deletion-cleanup/chainsaw-test.yaml:255-256 asserts that both KV-v2 paths are not-found after the keystone-cleanup CR is deleted — an assertion that cannot hold whenever a concurrent test's CR is mid-reconcile, because that CR's PushSecret is writing back to the shared path the cleanup test just purged.

The operator's own OpenBao policy (deploy/openbao/policies/push-keystone-keys.hcl:14-20) already rejects a wildcard over the Keystone subtree — for the orthogonal reason of protecting the read-only openstack/keystone/db credentials consumed by Keystone — so the flat two-path scheme is not even a policy simplification we are preserving.

Motivation: Reproduced on feature/CC-0088 in CI run 24903037791/job 72927386373 at 2026-04-24T17:53:33Z. The keystone-deletion-cleanup step-6 retry loop polled for 30 s and observed both KV-v2 paths continuously present with created_time=2026-04-24T17:53:11.602847833Z (Fernet) — created 10 s after the cleanup test's kubectl delete at 17:53:01 and exactly matching the keystone-brownfield-database step-2 apply at 17:53:11. The adjacent keystone-credential-rotation and keystone-concurrent-cr-conflicts CRs running in parallel contribute additional overlapping writers. Step-4's PushSecret NotFound assertion passes cleanly on the cleanup test's own backup objects — the leak is purely the shared remote path being re-created by a neighbour.

Previously observed on main in CI run 24839732116 at 14:31:30 (recorded in auto-memory project_shared_openbao_kv_path.md), 16 s after keystone-cleanup deletion, same failure mode. Both occurrences match the same timestamp-evidenced race; this is a reliably reproducible flake rather than a one-off. PR #271 (CC-0091) closed the independent adoption-race leak in the operator itself and added the bounded-retry loop step-6 depends on, but the loop cannot tolerate an external, legitimate writer regenerating the path — only this issue can.

Spin-off from analysing the red feature/CC-0088 E2E run; CC-0088 itself is unrelated (Envoy Gateway / HTTPRoute assertions).

Proposed approach:

  1. Switch both builders to a per-CR remote key layout, e.g. openstack/keystone/{keystone.Name}/fernet-keys and openstack/keystone/{keystone.Name}/credential-keys. The CR name is already in scope in both functions (fernetKeysPushSecret(keystone *keystonev1alpha1.Keystone) at reconcile_fernet.go:403, symmetric at reconcile_credential.go:409), so the change is a one-line fmt.Sprintf in each. A nested subpath keeps the convention of the sibling openstack/keystone/db path used by deploy/eso/externalsecrets/keystone-db.yaml:27 and preserves the existing wildcard reads granted by deploy/openbao/policies/eso-management.hcl:34 (kv-v2/data/openstack/keystone/*).

  2. Broaden the push-keystone-keys.hcl policy from two literal paths to two per-name patterns while still excluding /db:

    • kv-v2/{data,metadata}/openstack/keystone/+/fernet-keys
    • kv-v2/{data,metadata}/openstack/keystone/+/credential-keys

    The + single-segment glob (OpenBao/Vault capability template syntax) keeps the policy tight enough that a compromised ESO still cannot reach openstack/keystone/db, matching the push-keystone-keys.hcl:14-20 rationale. Verify the eso-management wildcard (kv-v2/data/openstack/keystone/*) still covers the new read paths so the redundant-read capability remains redundant rather than load-bearing.

  3. Update the step-6 chainsaw script in tests/e2e/keystone/deletion-cleanup/chainsaw-test.yaml:255-256 to target the CR-specific paths (kv-v2/openstack/keystone/keystone-cleanup/fernet-keys etc.), so the retry loop asserts exactly the path the cleanup CR owned and cannot be perturbed by parallel tests.

  4. Update both builder unit tests (reconcile_fernet_test.go:412, reconcile_credential_test.go:416) to the new expected RemoteKey string.

  5. Update reconciler docs: docs/reference/keystone-reconciler.md:763-764, :803, :1341, :1376, :1510, :1550, and architecture/docs/09-implementation/04-keystone-reconciler.md:279 to reference {keystone.Name} in the path.

Acceptance Criteria:

  • fernetKeysPushSecret and credentialKeysPushSecret emit per-CR RemoteKey values that embed keystone.Name.
  • push-keystone-keys.hcl grants create/update/read/delete on the per-name patterns only, still denies access to openstack/keystone/db, and does so under a policy comment that captures the rationale for including the name segment.
  • Two Keystone CRs named keystone-a and keystone-b in the same namespace produce four distinct populated KV-v2 paths (.../keystone-a/{fernet,credential}-keys, .../keystone-b/{fernet,credential}-keys) — verifiable by hand with bao kv list kv-v2/openstack/keystone/ or by extending tests/e2e/keystone/concurrent-cr-conflicts with a positive KV assertion.
  • chainsaw/keystone-deletion-cleanup step-6 asserts the keystone-cleanup-scoped paths and passes while any of keystone-autoscaling, keystone-credential-rotation, keystone-brownfield-database, or keystone-concurrent-cr-conflicts run in parallel (parallel: 4).
  • No regression in chainsaw/keystone-fernet-rotation or chainsaw/keystone-credential-rotation — PushSecret still reaches Ready=True on the new path.
  • Deletion-cleanup flake does not recur across 10 consecutive CI runs (manual gh workflow run e2e.yaml trigger) to confirm the race is closed, matching the validation bar used for Race between operator finalizer Delete and ESO PushSecret adoption leaks Fernet and credential keys in OpenBao #271.
  • Docs: the {keystone.Name} segment appears in every reference-doc location that previously listed the flat path, including the PushSecret table at docs/reference/keystone-reconciler.md:763-764 and the rotation sections at :1341/:1376/:1510/:1550.
  • Migration note added to docs/reference/keystone-reconciler.md: pre-existing clusters carrying data at kv-v2/openstack/keystone/{fernet,credential}-keys will see the operator write a new path on next reconcile; the old paths are orphaned but harmless (read-only from Keystone's perspective — the live Secret is driven from the local Kubernetes Secret, not the OpenBao backup) and can be hard-deleted manually. Call this out explicitly so upgraders don't interpret the stale path as a leak.

Refs: CC-0079 (openbao-finalizer), CC-0083 (PushSecret write policy), CC-0091 (adoption race), #271, #273. Auto-memory: project_shared_openbao_kv_path.md.

Labels: bug


Source: #279

berendt added a commit that referenced this pull request Apr 24, 2026
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The new fernet and credential RemoteKey integration tests in integration_test.go are nearly identical; consider extracting a shared helper or table-driven test to reduce duplication and keep the per-CR assertions in one place.
  • The new bao kv polling script in concurrent-cr-conflicts/chainsaw-test.yaml repeats a lot of setup and error-handling logic that is very similar to the deletion-cleanup chainsaw step; consider factoring the common BAO_ADDR/BAO_TOKEN/bootstrap and polling pattern into a reusable snippet or at least aligning variable names and structure to make future changes less error-prone.
  • In credentialKeysPushSecret and fernetKeysPushSecret, fmt.Sprintf("openstack/keystone/%s/...", keystone.Name) could be replaced with simple string concatenation to match the pattern used in the unit tests and avoid an unnecessary dependency on fmt in these hot paths.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new fernet and credential RemoteKey integration tests in `integration_test.go` are nearly identical; consider extracting a shared helper or table-driven test to reduce duplication and keep the per-CR assertions in one place.
- The new `bao kv` polling script in `concurrent-cr-conflicts/chainsaw-test.yaml` repeats a lot of setup and error-handling logic that is very similar to the deletion-cleanup chainsaw step; consider factoring the common BAO_ADDR/BAO_TOKEN/bootstrap and polling pattern into a reusable snippet or at least aligning variable names and structure to make future changes less error-prone.
- In `credentialKeysPushSecret` and `fernetKeysPushSecret`, `fmt.Sprintf("openstack/keystone/%s/...", keystone.Name)` could be replaced with simple string concatenation to match the pattern used in the unit tests and avoid an unnecessary dependency on `fmt` in these hot paths.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@berendt
Copy link
Copy Markdown
Contributor Author

berendt commented Apr 24, 2026

Reviewed by planwerk-review 889e22e with Claude Code

WARNING (2)

W-001: Migration note overstates operator intervention requirement

File: docs/reference/keystone-reconciler.md:793-795Fix: AUTO-FIX — Confidence: verified

Problem: The migration note claims the new per-CR path is applied automatically on the next reconcile after upgrade with no operator intervention required. This only holds for fresh deployments via hack/deploy-infra.sh, which runs openbao_bootstrap() → setup-policies.sh. For operators upgrading an existing cluster who do not re-run the bootstrap, the OpenBao ACL policy (deploy/openbao/policies/push-keystone-keys.hcl) is a filesystem artefact applied only via setup-policies.sh and is not reconciled by Flux — there is no Kubernetes object mapping the HCL file into OpenBao at runtime. Until the policy is re-applied, ESO will receive 403 on every push to the new per-CR paths, and FernetKeysReady / CredentialKeysReady will fail silently on the backup step.

Code:

`kv-v2/openstack/keystone/{keystone.Name}/credential-keys` — applied
automatically on the next reconcile after upgrade, with no operator
intervention required.

Action Required: Replace the first paragraph of the migration note to explicitly name the policy re-apply step, or tighten the claim to 'next reconcile after upgrading both the operator image and the OpenBao policy'.

Suggested Fix:

Before CC-0093, both backup PushSecrets wrote to the cluster-global, flat
KV-v2 paths `kv-v2/openstack/keystone/fernet-keys` and
`kv-v2/openstack/keystone/credential-keys`. Starting with CC-0093 the
operator writes to the per-CR-scoped paths
`kv-v2/openstack/keystone/{keystone.Name}/fernet-keys` and
`kv-v2/openstack/keystone/{keystone.Name}/credential-keys`.

The RemoteKey change lands the moment the Keystone operator is upgraded —
the next reconcile of each Keystone CR emits the new path. For existing
clusters the corresponding OpenBao ACL
(`deploy/openbao/policies/push-keystone-keys.hcl`) must also be re-applied
so ESO is authorised to write to the new paths; otherwise ESO will return
`403` on the backup step and `FernetKeysReady` / `CredentialKeysReady` will
flip to `False`. For kind/dev clusters this happens automatically when
`hack/deploy-infra.sh` (or `deploy/openbao/bootstrap/setup-policies.sh`)
is re-run; for production clusters managed outside the bootstrap flow the
equivalent is a single `bao policy write push-keystone-keys …` against the
updated HCL file.

Related: W-002


W-002: Policy comment inaccurately describes /db subtree exclusion

File: deploy/openbao/policies/push-keystone-keys.hcl:29-35Fix: AUTO-FIX — Confidence: verified

Problem: The comment claims 'db/anything' is not matched by the glob patterns, but this is not literally true: kv-v2/data/openstack/keystone/+/fernet-keys would match kv-v2/data/openstack/keystone/db/fernet-keys (with + binding to db), and likewise for db/credential-keys. A holder of this policy could write to those two specific sibling paths. The security invariant the paragraph actually wants to state — that the MariaDB credentials at kv-v2/data/openstack/keystone/db cannot be overwritten — still holds, because db and db/fernet-keys are independent keys in KV-v2 and writing to the latter does not modify the former. But the current wording invites a reviewer to conclude the whole db/ subtree is off-limits, which is not what the pattern enforces.

Code:

# The `/db`
# exclusion continues to hold under the new per-CR layout: `db` is a flat
# leaf with no child, and `+` here requires the literal `/fernet-keys` or
# `/credential-keys` suffix, so neither `kv-v2/…/keystone/db` nor
# `kv-v2/…/keystone/db/anything` is matched by these rules.

Action Required: Rewrite the comment to accurately describe the invariant: that the db leaf secret itself cannot be overwritten, while disclosing that a holder could write to the two sibling paths db/fernet-keys and db/credential-keys, which are independent KV-v2 keys that do not affect the db secret.

Suggested Fix:

# The read-only MariaDB credentials at `kv-v2/data/openstack/keystone/db`
# remain unwritable: `db` is a flat leaf and the patterns above require
# the literal `/fernet-keys` or `/credential-keys` suffix, so the `db`
# secret itself is not in the match set. A holder of this policy could
# write to the two sibling paths `kv-v2/…/keystone/db/fernet-keys` and
# `kv-v2/…/keystone/db/credential-keys` (via `+ = "db"`), but those are
# independent KV-v2 keys and writes to them do not affect the `db` secret;
# the "MariaDB credentials locked out of their own database" failure mode
# a trailing-star wildcard would enable is therefore still prevented.

Related: W-001


INFO (1)

I-001: Architecture submodule reference still carries the flat path

File: architecture/docs/09-implementation/04-keystone-reconciler.md:279Fix: ASK — Confidence: verified

Problem: The architecture submodule reference at architecture/docs/09-implementation/04-keystone-reconciler.md:279 still contains the pre-CC-0093 flat KV-v2 path. The commit log (7e8430d) correctly defers this to a separate upstream PR against github.com/C5C3/C5C3 per the repository's NO_SUBMODULE_MODIFICATIONS rule and flags the submodule-pin bump as a follow-up worktree. This is the right call — a submodule content change in this PR would violate repo policy.

Code:

architecture/docs/09-implementation/04-keystone-reconciler.md:279 (submodule — not modified in this PR per NO_SUBMODULE_MODIFICATIONS)

Action Required: Confirm that a separate follow-up ticket against C5C3/C5C3 exists to update the upstream architecture doc and subsequently bump the submodule pin. Open one if it does not already exist. No change required in this PR.

Suggested Fix:

Verify a follow-up ticket is tracked against C5C3/C5C3 to update line 279 to the per-CR path `kv-v2/openstack/keystone/{keystone.Name}/...` and plan a submodule pin bump in a subsequent PR.

Summary

The PR correctly fixes a concurrency bug by parameterizing the OpenBao RemoteKey with keystone.Name across six surfaces (PushSecret builders, OpenBao HCL policy, unit/envtest/chainsaw tests, and reference docs), with commit staging and scope kept clean. Test coverage is unusually strong: unit tests assert the legacy flat path is explicitly not produced, chainsaw asserts legacy paths remain absent, and the E2E scripts share a consistent bao-polling shape between concurrent-cr-conflicts and deletion-cleanup. Two documentation accuracy issues need fixing before merge — the migration note under-specifies the OpenBao policy re-apply step for brownfield upgrades, and the HCL policy comment mischaracterizes which paths the + glob excludes.

Severity Count
BLOCKING 0
CRITICAL 0
WARNING 2
INFO 1

Important

Recommendation: Merge after fixing WARNING findings 1 and 2; both are documentation-only corrections with drop-in replacements provided. Finding 3 is an acknowledged follow-up against the C5C3/C5C3 submodule and does not block this PR.

@berendt
Copy link
Copy Markdown
Contributor Author

berendt commented Apr 24, 2026

Regarding review finding I-001 (architecture submodule reference at architecture/docs/09-implementation/04-keystone-reconciler.md:279):

Per the repository's NO_SUBMODULE_MODIFICATIONS rule, the correction to line 279 cannot be applied from this worktree — it must land as a separate upstream PR against github.com/C5C3/C5C3, followed by a submodule-pin bump in C5C3/forge.

I searched the C5C3/C5C3 issue tracker (all states, ≤30 most recent) and did not find an existing ticket that tracks this change. Attempting to open one programmatically from this session returned GraphQL: Resource not accessible by personal access token (createIssue) — the PAT in use is scoped to the forge repo and cannot create issues on C5C3/C5C3.

Requested action: please open a tracking issue against C5C3/C5C3 (e.g. title: docs(keystone-reconciler): update KV-v2 path to per-CR layout (forge CC-0093 follow-up)) describing that line 279 must change from the flat path kv-v2/openstack/keystone/fernet-keys to the per-CR path kv-v2/openstack/keystone/{keystone.Name}/fernet-keys, with a subsequent submodule pin bump in this repository once that upstream PR lands.

The deferred follow-up is recorded in .planwerk/progress/CC-0093-parameterize-openbao-remotekey-per-keystone-cr.json under REQ-008 and task 5.x; no code change in this PR. Refs: review CC-0093-github-review-by-berendt-external-2.json.

- Scope fernet-keys backup PushSecret RemoteKey to
`openstack/keystone/{keystone.Name}/fernet-keys` so two Keystone CRs in
the same namespace cannot collide on a single OpenBao object
- Scope credential-keys backup PushSecret RemoteKey to
`openstack/keystone/{keystone.Name}/credential-keys` with the same
per-CR path layout
- Switch the `push-keystone-keys` OpenBao ACL from two literal leaf
paths to
`kv-v2/{data,metadata}/openstack/keystone/+/{fernet,credential}-keys`,
keeping `create/update/read/delete` on both data and metadata endpoints
for ESO PushSecret write + DeletionPolicy=Delete purges
- Preserve the existing lockout defence: the policy still enforces the
literal `/fernet-keys` or `/credential-keys` leaf suffix, so the
read-only MariaDB credentials at `kv-v2/data/openstack/keystone/db`
remain unwritable and a compromised ESO controller cannot overwrite them
- Extend envtest coverage in `reconcile_fernet_test.go` and
`reconcile_credential_test.go` to assert the `{name}` path segment and
verify two CRs in one namespace produce disjoint RemoteKeys
- Add integration-test cases that reconcile multiple Keystone CRs side
by side and assert per-CR RemoteKey isolation end to end
- Expand the `concurrent-cr-conflicts` chainsaw e2e to verify isolated
OpenBao paths under real reconciliation and tighten the
`deletion-cleanup` chainsaw to match the new per-CR path shape
- Document the per-CR path layout in
`docs/reference/keystone-reconciler.md`, including a migration note that
flags legacy flat paths as orphaned-but-harmless after upgrade and gives
the canonical `bao kv metadata delete` cleanup commands, plus an
explicit call-out that the updated policy HCL must be re-applied so ESO
retains write access

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant