From 13849aa8957958b3770a70ca98889d060d8ee63a Mon Sep 17 00:00:00 2001 From: siddhi bhor Date: Mon, 8 Jun 2026 12:47:14 +0530 Subject: [PATCH] Adds agents.md --- AGENTS.md | 204 ++++++++++++++++++++++++++++++ CLAUDE.md | 51 ++++++++ docs/api-contracts-guidelines.md | 157 +++++++++++++++++++++++ docs/error-handling-guidelines.md | 137 ++++++++++++++++++++ docs/integration-guidelines.md | 95 ++++++++++++++ docs/performance-guidelines.md | 92 ++++++++++++++ docs/security-guidelines.md | 125 ++++++++++++++++++ docs/testing-guidelines.md | 124 ++++++++++++++++++ 8 files changed, 985 insertions(+) create mode 100644 AGENTS.md create mode 100644 CLAUDE.md create mode 100644 docs/api-contracts-guidelines.md create mode 100644 docs/error-handling-guidelines.md create mode 100644 docs/integration-guidelines.md create mode 100644 docs/performance-guidelines.md create mode 100644 docs/security-guidelines.md create mode 100644 docs/testing-guidelines.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..c1c183842 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,204 @@ +# AGENTS.md + +## Docs Index + +Detailed domain-specific guidelines are in these files -- read them before working in the corresponding area: + +- [docs/security-guidelines.md](docs/security-guidelines.md) -- Container security, RBAC, TLS, network policies, webhook security +- [docs/performance-guidelines.md](docs/performance-guidelines.md) -- Cache architecture, watch predicates, reconciliation patterns, drift detection +- [docs/error-handling-guidelines.md](docs/error-handling-guidelines.md) -- ReconcileError types, retry logic, status conditions, logging +- [docs/api-contracts-guidelines.md](docs/api-contracts-guidelines.md) -- CRD types, kubebuilder markers, CEL validation, code generation +- [docs/testing-guidelines.md](docs/testing-guidelines.md) -- Unit/API/E2E test tiers, frameworks, test helpers, CI integration +- [docs/integration-guidelines.md](docs/integration-guidelines.md) -- Bindata pattern, cert-manager, OpenShift platform, Bitwarden plugin + +## Project Overview + +This is a Kubernetes operator (built with kubebuilder/controller-runtime) that deploys and manages the upstream [external-secrets](https://github.com/openshift/external-secrets) application on OpenShift clusters. The operator does not embed upstream code -- it manages upstream resources as static YAML manifests (bindata) applied imperatively. + +Three controllers run in a single binary: + +| Controller | Package | Watches | Purpose | +|---|---|---|---| +| `external-secrets-controller` | `pkg/controller/external_secrets/` | `ExternalSecretsConfig` CR + all managed resources | Installs/reconciles operand deployments, RBAC, services, webhooks, network policies | +| `external-secrets-manager` | `pkg/controller/external_secrets_manager/` | `ExternalSecretsManager` CR + `ExternalSecretsConfig` status | Aggregates controller statuses into a global status CR | +| `crd-annotator` | `pkg/controller/crd_annotator/` | ESO CRDs + `ExternalSecretsConfig` | Adds cert-manager CA injection annotations to operand CRDs (conditional) | + +## Project Structure + +``` +api/v1alpha1/ CRD type definitions, conditions, shared types, deepcopy +api/v1alpha1/tests/ Declarative YAML test suites for CRD validation +bindata/ Static operand YAML manifests (compiled into Go via go-bindata) +cmd/ Operator entrypoint (main.go) +config/ Kustomize manifests (CRDs, RBAC, manager, samples, bundle) +hack/ Shell scripts (codegen, verification, CI helpers) +images/ci/ CI Dockerfiles (coverage-instrumented builds) +pkg/controller/ Controller implementations + client/ CtrlClient interface + counterfeiter fakes + common/ Shared utilities (errors, constants, decode helpers, drift detection) + commontest/ Shared test fixtures (TestExternalSecretsConfig, TestExternalSecretsManager) + crd_annotator/ CRD annotation controller + external_secrets/ Main operand reconciliation controller + external_secrets_manager/ Status aggregation controller +pkg/operator/ Manager setup, controller registration +pkg/version/ Build-time version info (ldflags) +test/apis/ API integration tests (Ginkgo + envtest) +test/e2e/ End-to-end tests (Ginkgo + live cluster) +test/utils/ E2E test helpers +tools/ Go module for build-time tool dependencies +``` + +## Go Workspace and Module Layout + +The repo uses `go.work` with four modules: `.`, `./cmd/external-secrets-operator`, `./test`, `./tools`. This means: + +- `GOFLAGS` is cleared for test and fmt targets to avoid `-mod=vendor` conflicting with `go.work`. +- When adding a dependency, use `make update-dep PKG=pkg@version` to update across all modules, then `make update-vendor`. +- Vendoring is workspace-level (`go work vendor`), not per-module. +- All build-time tools (controller-gen, golangci-lint, ginkgo, etc.) are vendored and built from source via `go build -mod=vendor`. + +## Build System (Key Makefile Targets) + +| Target | What it does | +|---|---| +| `make build` | Full build: generate + manifests + fmt + vet + compile binary | +| `make build-operator` | Compile binary only (no codegen, fastest iteration) | +| `make test` | Run unit + API integration tests (no cluster needed) | +| `make test-unit` | Unit tests only | +| `make test-apis` | API validation tests via envtest | +| `make test-e2e` | E2E tests against a live cluster | +| `make lint` | Run golangci-lint with all configured linters | +| `make lint-fix` | Run golangci-lint with auto-fix | +| `make update` | Full regeneration: codegen + manifests + operand manifests + bindata + bundle + docs | +| `make verify` | Run vet + fmt + verify-deps + verify-bindata + verify-generated + govulncheck + check-git-diff | +| `make update-vendor` | Update vendor directory across all workspace modules | +| `make update-dep PKG=x@v` | Update a single dependency across all modules | +| `make manifests` | Regenerate CRD YAML, RBAC, webhook configs from kubebuilder markers | +| `make generate` | Regenerate DeepCopy methods | +| `make docs` | Regenerate API reference docs | +| `make clean` | Remove bin/, _output/, cover.out | + +After any code change, run `make update && make verify` to ensure all generated files are consistent. CI runs `check-git-diff` which fails if generated files are stale. + +## Code Style and Formatting + +### Import Order + +Imports must follow the order enforced by `gci` in `.golangci.yml`: + +1. Standard library +2. Third-party packages +3. `github.com/openshift/external-secrets-operator` (project-local) +4. Blank imports, dot imports, aliases, local module + +### Linting + +The repo uses golangci-lint v2 with a comprehensive set of linters (see `.golangci.yml`). Key rules: + +- **depguard** blocks `math/rand` (use `math/rand/v2`), `github.com/sirupsen/logrus`, and `github.com/pkg/errors` (use `errors`/`fmt`). +- **kubeapilinter** runs only on `api/v1alpha1/*` files. Use `//nolint:kubeapilinter` with a comment for intentional deviations. +- **golines** max line length is 200 characters. +- **gofmt** rewrites `interface{}` to `any` and `a[b:len(a)]` to `a[b:]`. +- Generated files are excluded in `lax` mode. +- Test files have relaxed rules for `gocyclo`, `errcheck`, `gosec`, `forcetypeassert`, and others. + +### File Headers + +All `.go` files must include the Apache 2.0 license header from `hack/boilerplate.go.txt` (year 2025). + +### FIPS Build + +Production builds use `hack/go-fips.sh` which enables `GOEXPERIMENT=strictfipsruntime` and build tags `strictfipsruntime,openssl` when the Go compiler supports it. Local dev builds without FIPS work but cannot be used in CI/production. + +## Naming Conventions + +### Go Packages + +Controller packages use `snake_case`: `external_secrets`, `external_secrets_manager`, `crd_annotator`. This matches kubebuilder conventions. + +### Constants + +- Controller names: `"external-secrets-controller"`, `"external-secrets-manager"`, `"crd-annotator"` (kebab-case in strings). +- Finalizer format: `./` (e.g., `externalsecretsconfigs.operator.openshift.io/external-secrets-controller`). +- Asset name constants: `_AssetName` in camelCase (e.g., `controllerDeploymentAssetName`). +- Environment variable constants: all-caps with suffix `EnvVarName` (e.g., `externalSecretsImageEnvVarName`). + +### Bindata YAML Files + +Files in `bindata/external-secrets/resources/` follow the pattern: `_.yml` (e.g., `deployment_external-secrets.yml`, `clusterrole_external-secrets-controller.yml`). Network policies in `bindata/external-secrets/` use `.yaml` extension. + +### CRD Object Names + +Both `ExternalSecretsConfig` and `ExternalSecretsManager` are singletons named `"cluster"` (enforced by CEL). Constants `ExternalSecretsConfigObjectName` and `ExternalSecretsManagerObjectName` in `pkg/controller/common/constants.go` hold this value. + +## Architectural Patterns + +### Reconciler Structure + +Every controller follows the same pattern: + +1. A `Reconciler` struct embedding `operatorclient.CtrlClient`. +2. A `New(ctx, mgr)` constructor that builds the reconciler and client(s). +3. A `SetupWithManager(mgr)` method that wires watches, predicates, and map functions. +4. A `Reconcile(ctx, req)` method that fetches the primary CR and delegates to `processReconcileRequest`. +5. An `updateStatus(ctx, obj)` method using `retry.RetryOnConflict`. + +### CtrlClient Interface + +All controllers interact with Kubernetes through `pkg/controller/client.CtrlClient`, not the raw `controller-runtime client.Client`. This interface adds `UpdateWithRetry`, `StatusUpdate`, and `Exists` methods. Unit tests use counterfeiter-generated fakes (`pkg/controller/client/fakes/`). + +To regenerate fakes after changing the interface: `go generate ./pkg/controller/client/...` + +### Resource Reconciliation Pattern + +For each resource type (deployments, services, RBAC, etc.), the same flow is used: + +1. Decode static YAML from bindata (`common.Decode*ObjBytes(assets.MustAsset(...))`). +2. Mutate the decoded object (set namespace, labels, annotations, images, env vars, security context). +3. Check existence with `r.Exists()`. +4. If new: `r.Create()`. If existing: compare with `common.HasObjectChanged()`, then `r.UpdateWithRetry()` if drifted. +5. Record Kubernetes events for create/update operations. +6. Wrap errors with `common.FromClientError()`. + +### Conditional Resources + +Resources that depend on CR configuration use a slice of `{assetName string, condition bool}` structs. Only assets with `condition: true` are applied. Follow this pattern when adding new conditionally-created resources. + +## Common Pitfalls + +1. **Never return both `RequeueAfter` and a non-nil error** from `Reconcile`. Return one or the other. +2. **Never edit generated files by hand**: `zz_generated.deepcopy.go`, `config/crd/bases/*.yaml`, `pkg/operator/assets/bindata.go`, `config/rbac/role.yaml`, `docs/api_reference.md`. Always use `make update`. +3. **Always run `make update && make verify`** after any code change. CI will reject PRs with stale generated files. +4. **Use the cached client for managed resources** (those with `app: external-secrets` label). Use the uncached client only for external resources like cert-manager Issuers or user-provided Secrets. +5. **Add new watched resources to both `controllerManagedResources` and `buildCacheObjectList()`** in `pkg/controller/external_secrets/controller.go`. +6. **Add new resource types to `HasObjectChanged`** in `pkg/controller/common/utils.go` with field-level comparison, not full `DeepEqual`. +7. **Decode helpers panic on failure** -- this is intentional for build-time-constant assets. Do not add error handling around `Decode*ObjBytes` calls. +8. **RBAC markers** (`+kubebuilder:rbac`) go in `pkg/controller/external_secrets/controller.go` for the operator's own permissions. Operand RBAC is in static YAML under `bindata/`. +9. **The `go.work` workspace** means you cannot use `-mod=vendor` for `go test` or `go fmt`. The Makefile already clears `GOFLAGS` for affected targets. +10. **Container tool defaults to `podman`** (`CONTAINER_TOOL ?= podman`), not Docker. Override with `CONTAINER_TOOL=docker` if needed. + +## PR and Contribution Expectations + +- Run `make lint` and `make test` locally before submitting. +- Run `make verify` to ensure generated files are in sync. +- Add unit tests for new reconciliation logic using table-driven tests and `FakeCtrlClient`. +- Add API test cases in `api/v1alpha1/tests//` for any new CRD field or validation rule. +- Add E2E test cases with appropriate Ginkgo labels for platform-specific tests. +- Follow existing error wrapping patterns: `common.FromClientError` for API calls, `common.NewIrrecoverableError` for config validation failures. +- Commit messages should reference the relevant Jira ticket (e.g., `OCPBUGS-12345: description`). +- PR reviewers/approvers are listed in `OWNERS`. + +## Environment Variables + +The operator reads these at runtime (typically set by OLM/CSV): + +| Variable | Purpose | +|---|---| +| `RELATED_IMAGE_EXTERNAL_SECRETS` | Container image for the external-secrets operand | +| `RELATED_IMAGE_BITWARDEN_SDK_SERVER` | Container image for the Bitwarden SDK server | +| `OPERAND_EXTERNAL_SECRETS_IMAGE_VERSION` | Version label for operand resources | +| `BITWARDEN_SDK_SERVER_IMAGE_VERSION` | Version label for Bitwarden resources | +| `OPERATOR_IMAGE_VERSION` | Operator version string | +| `HTTP_PROXY`, `HTTPS_PROXY`, `NO_PROXY` | Proxy fallback from OLM environment | + +For local development, set at minimum `RELATED_IMAGE_EXTERNAL_SECRETS` via `t.Setenv()` in tests or shell export for `make run`. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..588f94e42 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,51 @@ +# CLAUDE.md + +@AGENTS.md + +## Build and Test Commands + +When working in this repository, use these commands: + +### Before Committing +```bash +make update && make verify +``` +This regenerates all code, manifests, and bindata, then runs verification checks. CI will reject PRs with stale generated files. + +### Development Workflow +```bash +make build # Full build with codegen +make build-operator # Fast rebuild (binary only, no codegen) +make test # Run unit + API tests (no cluster needed) +make lint # Run golangci-lint +make lint-fix # Auto-fix linting issues +``` + +### Testing +```bash +make test-unit # Unit tests only +make test-apis # API validation tests (envtest) +make test-e2e # E2E tests (requires cluster) +``` + +### Dependency Management +```bash +make update-vendor # Sync vendor across all workspace modules +make update-dep PKG=package@version # Update a dependency in all modules +``` + +## Claude Code Preferences + +- Always run `make update && make verify` after code changes that affect: + - CRD definitions (`api/v1alpha1/`) + - Kubebuilder markers (`+kubebuilder:*`) + - Bindata YAML (`bindata/`) + - Generated code triggers + +- Use `make lint-fix` to automatically fix formatting and linting issues before suggesting manual fixes + +- The repository uses a Go workspace (`go.work`) with 4 modules. Never manually edit `GOFLAGS` or use `-mod=vendor` in test/fmt commands — the Makefile handles this + +- Container builds default to `podman`. Override with `CONTAINER_TOOL=docker` if needed + +- All build-time tools are vendored. Do not suggest installing tools globally diff --git a/docs/api-contracts-guidelines.md b/docs/api-contracts-guidelines.md new file mode 100644 index 000000000..708535ca1 --- /dev/null +++ b/docs/api-contracts-guidelines.md @@ -0,0 +1,157 @@ +# API Contracts Guidelines + +## API Group and Versioning + +- Group: `operator.openshift.io`, Version: `v1alpha1`. All types live in `api/v1alpha1/`. +- Only one version exists. The package-level marker `+groupName=operator.openshift.io` in `groupversion_info.go` sets the group. +- Every new type must be registered via `SchemeBuilder.Register()` in its `init()` function within the types file. + +## CRD Resources + +Two cluster-scoped singletons exist: + +| Kind | Plural | Short Names | Purpose | +|---|---|---|---| +| ExternalSecretsConfig | externalsecretsconfigs | esc, externalsecretsconfig, esconfig | Configures the external-secrets operand | +| ExternalSecretsManager | externalsecretsmanagers | esm, externalsecretsmanager, esmanager | Global operator-level config, auto-created at install | + +Both are enforced singletons via CEL: `self.metadata.name == 'cluster'`. Both use `+genclient:nonNamespaced` and `scope=Cluster`. + +## Required Kubebuilder Markers on CRD Types + +Every root CRD type must carry these markers: +``` ++genclient ++genclient:nonNamespaced ++k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object ++kubebuilder:object:root=true ++kubebuilder:subresource:status ++kubebuilder:resource:path=,scope=Cluster,categories={external-secrets-operator, external-secrets},shortName= ++kubebuilder:metadata:labels={"app.kubernetes.io/name=", "app.kubernetes.io/part-of=external-secrets-operator"} ++operator-sdk:csv:customresourcedefinitions:displayName="" +``` + +List types need `+kubebuilder:object:root=true` and `+k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object`. + +## Field-Level Validation Conventions + +### String Fields +- Set `+kubebuilder:validation:MinLength` and `+kubebuilder:validation:MaxLength` on spec string fields (e.g., 1-253 for names, 0-2048 for URLs, 0-4096 for noProxy). +- Status fields (e.g., `ExternalSecretsImage`, `BitwardenSDKServerImage`) and custom condition message fields typically omit validation markers. +- Use `+kubebuilder:validation:Pattern` for key-format constraints (e.g., ConfigMap keys: `^[-._a-zA-Z0-9]+$`). + +### Map Fields +- Set `+kubebuilder:validation:MinProperties` and `+kubebuilder:validation:MaxProperties` on spec maps (typical max: 20 for labels/annotations, 50 for nodeSelector). +- Set `+mapType=granular` for strategic-merge-patchable maps; `+mapType=atomic` for replace-on-update maps. + +### List Fields +- Set `+kubebuilder:validation:MinItems` and `+kubebuilder:validation:MaxItems` on spec lists. +- Status lists (e.g., `Conditions`) typically omit MinItems/MaxItems validation. +- For merge-patchable lists: use `+listType=map` with `+listMapKey=`. Composite keys are supported (e.g., networkPolicies uses both `name` and `componentName`). +- For replace-on-update lists: use `+listType=atomic`. +- Status subresource lists use `+patchMergeKey=` and `+patchStrategy=merge` in addition to listType markers for proper merge behavior. + +### Numeric Fields +- Use `+kubebuilder:validation:Minimum` and `+kubebuilder:validation:Maximum` (e.g., logLevel: 1-5, revisionHistoryLimit: 1-50). + +### Enum Fields +- Use `+kubebuilder:validation:Enum` for closed sets. Define a named Go type with exported constants (e.g., `Mode` with `Enabled`/`Disabled`, `ManagementState` with `Managed`/`Unmanaged`, `ComponentName` with four values). + +## CEL Validation Rules (XValidation) + +CEL rules via `+kubebuilder:validation:XValidation` are the primary mechanism for cross-field and immutability validation. Patterns used: + +- **Singleton enforcement**: `rule="self.metadata.name == 'cluster'"` on the root type. +- **Immutability**: `rule="self == oldSelf"` on individual fields (e.g., certManager `mode`, `issuerRef`, `injectAnnotations`). +- **List key immutability**: `rule="oldSelf.all(op, self.exists(p, p.name == op.name && p.componentName == op.componentName))"` to prevent renaming list entries. +- **Cross-field dependencies**: Complex rules on spec structs to enforce "if X is enabled, then Y or Z must be configured." +- **Annotation domain blocklists**: Regex-based `self.all(key, !key.matches(...))` rules blocking `kubernetes.io/`, `openshift.io/`, `k8s.io/`, `cert-manager.io/` prefixes. +- **Reserved env var names**: `self.all(e, !['PREFIX_'].exists(p, e.name.startsWith(p)) && e.name != 'EXACT_NAME')`. + +Always provide a `message` on every XValidation rule. + +## kubeapilinter Nolint Directives + +The repo uses `//nolint:kubeapilinter` comments for intentional deviations. Document the reason inline. Known patterns: +- `listMapKey` fields must NOT have `omitempty` (for proper patch merge identification). Apply `//nolint:kubeapilinter // is a listMapKey and must not have omitempty`. +- Custom `Condition` type (on ExternalSecretsManager) that intentionally omits some `metav1.Condition` fields. +- `metav1.Duration` fields retained despite linter preference, annotated with `// Duration type retained to avoid breaking API change`. + +## Status Subresource Patterns + +### ExternalSecretsConfig Status +- Embeds `ConditionalStatus` (from `meta.go`) which holds `[]metav1.Condition` with standard merge markers (`+listType=map`, `+listMapKey=type`, `+patchMergeKey=type`, `+patchStrategy=merge`). +- Uses standard `metav1.Condition` with `ObservedGeneration` set to `esc.GetGeneration()`. +- Condition types: `Ready` and `Degraded` (defined as constants in `conditions.go`). +- Reasons: `Failed`, `Ready`, `Progressing`, `Completed` (also constants, with `Progressing` defined as constant `ReasonInProgress`). +- Both conditions are set atomically before a single status update call. +- Additional status fields: `externalSecretsImage`, `bitwardenSDKServerImage`. + +### ExternalSecretsManager Status +- Uses a custom `ControllerStatus` list (`+listType=map`, `+listMapKey=name`) tracking per-controller conditions. +- Each `ControllerStatus` contains a custom `[]Condition` type (Type, Status, Message -- no Reason or LastTransitionTime) and `ObservedGeneration`. +- Top-level `lastTransitionTime` is set on every condition change. + +### Status Update Pattern +All controllers follow the same retry-on-conflict pattern: +1. Fetch current object from API server. +2. DeepCopy the changed status into the fetched object. +3. Call `StatusUpdate` (status subresource update). +4. Wrap in `retry.RetryOnConflict(retry.DefaultRetry, ...)`. + +## Defaults + +Use `+kubebuilder:default` for server-side defaulting: +- `logLevel`: 1 +- `mode`: `Disabled` +- `injectAnnotations`: `"false"` +- `certificateCheckInterval`: `"5m"` +- `certificateDuration`: `"8760h"` +- `certificateRenewBefore`: `"30m"` +- `revisionHistoryLimit`: 10 +- `networkPolicyProvisioning`: `Managed` +- `key` (ConfigMapKeyReference): `"ca-bundle.crt"` + +## Shared Types and Composition + +- `CommonConfigs` (in `meta.go`) is embedded via `json:",inline"` in both `ApplicationConfig` and `GlobalConfig`. It provides `logLevel`, `resources`, `affinity`, `tolerations`, `nodeSelector`, `proxy`. +- `ObjectReference`, `SecretReference`, `ConfigMapKeyReference`, `ConditionalStatus` are reusable building blocks in `meta.go`. +- Named types (`Mode`, `ManagementState`, `ComponentName`) with constants must be used instead of raw strings for enum fields. + +## Code Generation Pipeline + +After any API type change, run: +1. `make generate` -- regenerates `zz_generated.deepcopy.go` via controller-gen. +2. `make manifests` -- regenerates CRD YAMLs in `config/crd/bases/`, RBAC, and webhook configs. +3. `make update` -- full pipeline: generate + manifests + operand manifests + bindata + bundle + docs. +4. `make docs` -- regenerates `docs/api_reference.md` from `api/v1alpha1/` using `crd-ref-docs`. +5. `make verify` -- runs `check-git-diff` to ensure all generated files are committed. + +Never edit `zz_generated.deepcopy.go` or CRD YAML files by hand. + +## API Integration Tests + +API validation is tested via declarative YAML test suites in `api/v1alpha1/tests//`. Each `.testsuite.yaml` file defines `onCreate` and `onUpdate` test cases specifying `initial`, `expected`, and `expectedError` YAML. The test framework (in `test/apis/`) installs CRDs into a real envtest API server (requires Kube >= 1.25 for CEL). + +When adding a new CEL validation rule or field constraint, add corresponding test cases to the relevant `.testsuite.yaml` covering: +- Valid creation (initial + expected) +- Invalid creation (initial + expectedError with substring match) +- Immutability on update (initial + updated + expectedError) +- Boundary values (min/max lengths, min/max items) + +Run API tests with `make test-apis` or via the general `make test`. + +## Webhook Architecture + +This operator does NOT use kubebuilder-generated admission webhooks for its own CRDs. CEL-based CRD validation handles all admission validation. The webhook code in `pkg/controller/external_secrets/` manages the upstream external-secrets operand's `ValidatingWebhookConfiguration` resources (not the operator's own API validation). + +## Adding New API Fields Checklist + +1. Add the Go field to the appropriate struct in `api/v1alpha1/`. +2. Add validation markers (min/max, enum, XValidation) for spec fields; status fields typically omit validation. +3. Use pointer types for optional sub-structs; value types for required scalars with defaults. +4. Add `+optional` or `+required` marker. Use `json:"fieldName,omitempty"` for optional fields. +5. Exception: listMapKey fields must NOT have `omitempty`; add `//nolint:kubeapilinter` with reason. +6. Write test cases in the corresponding `.testsuite.yaml` for both valid and invalid inputs. +7. Run `make update && make verify` to regenerate and validate all artifacts. +8. If the field adds a new condition type or reason, add constants in `conditions.go`. diff --git a/docs/error-handling-guidelines.md b/docs/error-handling-guidelines.md new file mode 100644 index 000000000..2eb97326c --- /dev/null +++ b/docs/error-handling-guidelines.md @@ -0,0 +1,137 @@ +# Error Handling Guidelines + +## Custom Error Type: `ReconcileError` + +Resource reconciliation errors flow through the `ReconcileError` type defined in `pkg/controller/common/errors.go`. It carries a `Reason` (`ErrorReason` string) that determines requeue behavior, a human-readable `Message`, and the underlying `Err`. Setup errors (finalizers, client construction, CR fetching) use plain `fmt.Errorf` wrapping as described below. + +### Two Error Reasons + +| Reason | Constant | Requeue? | When to use | +|---|---|---|---| +| `IrrecoverableError` | `common.IrrecoverableError` | No | Invalid config, missing env vars, permission errors, bad requests | +| `RetryRequiredError` | `common.RetryRequiredError` | Yes (30s) | Transient network issues, resource conflicts, timeouts, not-found | + +### Constructor Functions + +- `common.NewIrrecoverableError(err, messageFmt, args...)` -- for errors that cannot be fixed by retrying. +- `common.NewRetryRequiredError(err, messageFmt, args...)` -- for transient errors worth retrying. +- `common.FromClientError(err, messageFmt, args...)` -- auto-classifies Kubernetes API errors: `Unauthorized`, `Forbidden`, `Invalid`, and `BadRequest` become irrecoverable; everything else becomes retry-required. +- All three return `nil` when the input `err` is `nil`. Always check this at call sites when the constructor is the last expression before return. + +### Checking Error Type + +Use `common.IsIrrecoverableError(err)` to check whether an error is irrecoverable. It uses `errors.As` to traverse wrapped error chains. There is no corresponding `IsRetryRequiredError` -- the convention is: if it is not irrecoverable and is a `ReconcileError`, it is retryable. + +## Error-to-Reconcile-Result Mapping + +The main reconcile dispatcher in `processReconcileRequest` (in `pkg/controller/external_secrets/controller.go`) maps errors to `ctrl.Result` as follows: + +``` +err == nil -> Result{}, nil (success, no requeue) +IsIrrecoverableError -> Result{}, errUpdate (no requeue; only status update error propagates) +retryable error -> Result{RequeueAfter: 30s}, nil (manual requeue, no error to controller-runtime) +status update failure -> Result{}, errUpdate (let controller-runtime handle backoff) +NotFound on primary CR -> Result{}, nil (skip reconciliation silently) +``` + +Key rule: never return both `RequeueAfter` and a non-nil error. For recoverable errors, return `RequeueAfter` with `nil` error. For irrecoverable errors, return empty `Result` so no requeue happens. + +The default requeue interval is `common.DefaultRequeueTime` (30 seconds), defined in `pkg/controller/common/constants.go`. + +## Wrapping Errors from Kubernetes API Calls + +For any Kubernetes client operation (Get, Create, Update, Delete, Patch, Exists), wrap the error with `common.FromClientError`: + +```go +if err := r.Create(r.ctx, obj); err != nil { + return common.FromClientError(err, "failed to create %s deployment resource", name) +} +``` + +This is the standard pattern across all resource reconciliation files (rbacs.go, deployments.go, services.go, secret.go, certificate.go, configmap.go, networkpolicy.go, validatingwebhook.go, serviceaccounts.go). `FromClientError` auto-classifies the API error so callers do not manually pick irrecoverable vs retryable. + +For non-API errors that are definitively unrecoverable (e.g., missing environment variable, invalid configuration, failed validation), use `common.NewIrrecoverableError` directly: + +```go +return common.NewIrrecoverableError( + fmt.Errorf("ENV_VAR not set"), + "failed to update image in %s deployment object", name, +) +``` + +## `fmt.Errorf` with `%w` for Non-Reconcile Errors + +For errors outside the reconcile-result classification path (setup, client construction, utility functions), use standard `fmt.Errorf` with `%w` wrapping: + +```go +return fmt.Errorf("failed to create uncached client: %w", err) +``` + +Do not wrap these in `ReconcileError` -- they propagate as plain errors up through controller setup or `Reconcile` return. + +## Retry Logic + +### `UpdateWithRetry` (Client-Level Retry) + +All resource updates that may hit conflicts use `r.UpdateWithRetry(ctx, obj)` instead of `r.Update(ctx, obj)`. This method (in `pkg/controller/client/client.go`) wraps `retry.RetryOnConflict(retry.DefaultRetry, ...)`, re-fetching the latest resource version before each attempt. + +Use `UpdateWithRetry` for updating managed resources (Deployments, RBAC, Services, etc.) and for finalizer updates. + +### `retry.RetryOnConflict` (Status Updates) + +Status subresource updates use `retry.RetryOnConflict(retry.DefaultRetry, ...)` directly. The pattern is: fetch latest, deep-copy desired status into it, then call `StatusUpdate`. This appears identically in all three controllers. + +### `retry.OnError` with Custom Predicate + +The ESM default resource creation (`pkg/controller/external_secrets_manager/externalsecretsmanager.go`) uses `retry.OnError` with a custom `shouldRetryOnError` predicate that stops retrying on `AlreadyExists`, `Conflict`, `Invalid`, `BadRequest`, `Unauthorized`, `Forbidden`, and `TooManyRequests`. + +## Status Condition Updates + +### Condition Types + +| Type | Defined In | Used By | +|---|---|---| +| `Degraded` | `api/v1alpha1/conditions.go` | external_secrets controller | +| `Ready` | `api/v1alpha1/conditions.go` | external_secrets controller | +| `UpdateAnnotation` | `api/v1alpha1/conditions.go` | crd_annotator controller | + +### Condition Update Rules + +1. On irrecoverable error: set `Degraded=True/Failed` and `Ready=False/Ready`. +2. On retryable error: set `Degraded=False/Ready` and `Ready=False/Progressing` with the error message. +3. On success: set `Degraded=False/Ready` and `Ready=True/Ready`. +4. Set both conditions atomically via `apimeta.SetStatusCondition` before calling `updateStatus`. +5. Only call `updateStatus` when `SetStatusCondition` returns `true` (condition actually changed). +6. Always include `ObservedGeneration` from the CR's current `.metadata.generation`. + +### Error Aggregation on Status Update Failure + +When the status update itself fails alongside a reconciliation error, aggregate both using `utilerrors.NewAggregate([]error{err, errUpdate})`. This pattern exists in both the external_secrets and crd_annotator controllers. + +## Kubernetes Event Recording + +- Use `corev1.EventTypeNormal` + reason `"Reconciled"` for successful create/update of resources. +- Use `corev1.EventTypeWarning` + reason `"ResourceAlreadyExists"` when a resource exists during initial creation reconciliation. +- Use `corev1.EventTypeWarning` + reason `"RemoveDeployment"` on CR deletion. +- Use `corev1.EventTypeWarning` + reason `"Read"` on fetch failures (ESM controller). +- Event messages must include the resource name (namespace/name format). + +## Logging Conventions + +- `r.log.V(1).Info(...)` -- operational events (reconcile start, resource modifications, label/annotation changes). +- `r.log.V(4).Info(...)` -- detailed debug (resource already in expected state, status update attempts, metadata builds). +- `r.log.Error(err, "message")` -- always log the error at the point it occurs in `reconcileExternalSecretsDeployment`, before returning it upward. The top-level `processReconcileRequest` also logs the error again with `r.log.Error(err, "failed to reconcile external-secrets deployment")`. +- `ctrl.Log.V(1).WithName("subsystem")` -- for setup-time logging outside reconciler instances (cache setup, CRD discovery). +- Use structured key-value pairs (`"name"`, `"request"`, `"key"`, `"namespace"`, `"error"`, `"installed"`), never string interpolation in log messages. + +## `Exists` Helper + +The `CtrlClient.Exists(ctx, key, obj)` method converts `NotFound` errors to `(false, nil)`, passing all other errors through. Use this instead of raw `Get` + manual `IsNotFound` checks when you only need existence. + +## Panics (Decode Helpers Only) + +The `Decode*ObjBytes` functions in `pkg/controller/common/utils.go` panic on decode failure or type assertion failure. These are called only with compile-time-known static assets (`assets.MustAsset`), so panics indicate a build-time bug, not a runtime error. Never use panic for runtime error handling elsewhere. + +## NotFound Handling in Reconcile + +When the primary CR (`ExternalSecretsConfig` or `ExternalSecretsManager`) is not found, return `ctrl.Result{}, nil` -- do not requeue. Log at V(1) and skip reconciliation. When a secondary/dependent CR is not found, either skip gracefully (ESM looking for ESC) or wrap in `fmt.Errorf` and return for requeue. diff --git a/docs/integration-guidelines.md b/docs/integration-guidelines.md new file mode 100644 index 000000000..2f9692502 --- /dev/null +++ b/docs/integration-guidelines.md @@ -0,0 +1,95 @@ +# Integration Guidelines + +## Architecture Overview + +This operator does NOT embed or fork the upstream external-secrets project. It manages the upstream operand as a set of static YAML manifests (bindata) that are decoded at runtime, mutated with operator-controlled labels/annotations/env/images, and applied to the cluster. There is no Helm chart integration; resource creation is fully imperative via controller-runtime `Create`/`Update`. + +## CRD Hierarchy and Singleton Pattern + +- **ExternalSecretsConfig** (`externalsecretsconfigs.operator.openshift.io`): primary CR, singleton named `cluster`. Controls operand installation, cert-manager toggle, Bitwarden plugin, proxy, network policies, and per-component overrides. +- **ExternalSecretsManager** (`externalsecretsmanagers.operator.openshift.io`): global config CR, also singleton named `cluster`. Provides lower-priority defaults (labels, resources, proxy, tolerations, affinity, nodeSelector, logLevel) that ExternalSecretsConfig overrides. +- Precedence chain for shared fields: `ExternalSecretsConfig > ExternalSecretsManager > OLM environment variables` (proxy only). + +## Static Manifest (Bindata) Pattern + +All operand Kubernetes resources live as YAML in `bindata/external-secrets/` and `bindata/external-secrets/resources/`. They are compiled into Go via `pkg/operator/assets/bindata.go` (generated with `go-bindata`). At reconcile time: +1. Decode bytes with typed decoders in `pkg/controller/common/utils.go` (e.g., `DecodeDeploymentObjBytes`, `DecodeServiceObjBytes`). +2. Mutate the decoded object: set namespace, apply labels/annotations via `common.ApplyResourceMetadata`, update container images/args, inject proxy env vars, mount CA bundles. +3. Check existence with `r.Exists()`, compare with `common.HasObjectChanged()`, then `Create` or `UpdateWithRetry`. + +When adding a new resource: add the YAML to `bindata/`, add a constant in `constants.go` mapping the asset path, regenerate bindata, and follow the existing `createOrApply*` pattern. + +## Reconciliation Order + +Resources are created in strict dependency order within `reconcileExternalSecretsDeployment`: +Namespace -> NetworkPolicies -> ServiceAccounts -> Certificates -> Secrets -> TrustedCA ConfigMap -> RBAC -> Services -> Deployments -> ValidatingWebhooks -> CR annotation tracking. + +Never reorder. CR annotation tracking (managed-annotations) is patched last to ensure obsolete annotations are removed from resources before tracking is updated. + +## Conditional Resource Creation + +Many resources are conditionally created based on CR spec. The pattern uses a slice of `{assetName, condition}` structs: +- **cert-controller deployment/service**: created when cert-manager is NOT enabled (`!isCertManagerConfigEnabled(esc)`) +- **bitwarden deployment/service/certificate/network-policy**: created when Bitwarden IS enabled (`isBitwardenConfigEnabled(esc)`) +- **webhook TLS secret**: created only when cert-manager is NOT enabled (cert-manager manages the secret otherwise, named `external-secrets-webhook-cm` to avoid collision) + +## cert-manager Integration + +- cert-manager is an optional dependency detected at startup via CRD discovery (`isCRDInstalled` checks `cert-manager.io/v1` `certificates` resource). +- When detected, the Certificate CRD informer is registered with the manager cache. When not detected, the cache omits it entirely to prevent startup failures. +- The `CertManagerConfig` in ExternalSecretsConfig spec requires an `issuerRef` (Issuer or ClusterIssuer). The operator validates the issuer exists via `assertIssuerRefExists` using the uncached client before creating Certificate resources. +- When cert-manager is enabled with `injectAnnotations: "true"`, the `crd-annotator` controller patches all ESO CRDs with `cert-manager.io/inject-ca-from: external-secrets/external-secrets-webhook`. +- Webhook ValidatingWebhookConfigurations get the `cert-manager.io/inject-ca-from` annotation via `withCertManagerAnnotation()`, tracked in managed-annotations for proper lifecycle. + +## OpenShift Platform Integrations + +- **Trusted CA bundle**: when proxy is configured, a ConfigMap `external-secrets-trusted-ca-bundle` is created with label `config.openshift.io/inject-trusted-cabundle: "true"`. OpenShift's Cluster Network Operator (CNO) injects cluster CA certs into it. The ConfigMap data is owned by CNO; the operator only manages labels/annotations. Mounted at `/etc/pki/tls/certs` in all containers. +- **Proxy**: both uppercase and lowercase variants (`HTTP_PROXY`/`http_proxy`, etc.) are set on all containers and init containers. Proxy is removed cleanly when config is cleared. +- **Network policies**: default deny-all is always applied. Static allow policies for API server egress, webhook, DNS, and optionally cert-controller/bitwarden are applied from bindata. Users add custom egress-only policies via `controllerConfig.networkPolicies` with component targeting. +- **Security context**: all containers get hardened `SecurityContext` (non-root, read-only root FS, drop ALL capabilities, seccomp RuntimeDefault). +- **Console capability annotation**: ConsoleYAMLSample resources require `capability.openshift.io/name: Console` annotation. + +## Bitwarden Plugin Integration + +- Bitwarden is the only currently supported plugin. Enabled via `spec.plugins.bitwardenSecretManagerProvider.mode: Enabled`. +- Requires either a `secretRef` (pre-existing TLS secret with keys `tls.crt`, `tls.key`, `ca.crt`) or cert-manager configuration for automated TLS. +- The bitwarden-sdk-server runs as a separate deployment in the operand namespace. Its image comes from env var `RELATED_IMAGE_BITWARDEN_SDK_SERVER`. +- Network policies must explicitly allow egress from the core controller to bitwarden-sdk-server on port 9998. +- ClusterSecretStore for Bitwarden requires: `bitwardenServerSDKURL`, `caBundle` (base64 CA cert), `organizationID`, `projectID`, and auth `secretRef.credentials` pointing to an access token secret. + +## Image Management + +Container images in static YAML manifests (`bindata/`) have placeholder values that are replaced at runtime. The reconciler reads images from environment variables set on the operator pod (typically by OLM/CSV): +- `RELATED_IMAGE_EXTERNAL_SECRETS`: external-secrets operand image +- `RELATED_IMAGE_BITWARDEN_SDK_SERVER`: bitwarden-sdk-server image +- `OPERAND_EXTERNAL_SECRETS_IMAGE_VERSION`: version label for operand resources +- `BITWARDEN_SDK_SERVER_IMAGE_VERSION`: version label for bitwarden resources +- `OPERATOR_IMAGE_VERSION`: operator version + +The reconciler returns an irrecoverable error if image env vars are empty. + +## Client Architecture + +- **Cached client** (`r.CtrlClient`): reads from manager cache filtered by `app=external-secrets` label selector. Used for all managed resources. +- **Uncached client** (`r.UncachedClient`): reads directly from API server. Used for objects NOT managed by the controller (issuer validation, secret ref existence checks). +- **UpdateWithRetry**: all mutations use `UpdateWithRetry` which fetches latest resourceVersion before update, wrapped in `retry.RetryOnConflict`. + +## Label and Annotation Conventions + +- All managed resources carry `app: external-secrets` label (cache filter key). +- Default labels: `app`, `app.kubernetes.io/version`, `app.kubernetes.io/managed-by: external-secrets-operator`, `app.kubernetes.io/part-of: external-secrets-operator`. +- Disallowed label prefixes (rejected silently): `app.kubernetes.io/`, `external-secrets.io/`, `rbac.authorization.k8s.io/`, `servicebinding.io/controller`, `app`. +- Disallowed annotation domain prefixes (rejected by CRD CEL validation): `kubernetes.io/`, `k8s.io/`, `openshift.io/`, `cert-manager.io/`. +- Managed annotation keys are tracked in a base64-encoded JSON array on the CR annotation `externalsecretsconfig.operator.openshift.io/managed-annotations`. + +## Adding a New Provider Plugin + +Follow the Bitwarden pattern: +1. Add new `Mode`-gated fields to `PluginsConfig` in `api/v1alpha1/external_secrets_config_types.go`. +2. Add static YAML manifests (deployment, service, service account, network policy) to `bindata/external-secrets/resources/`. +3. Add asset name constants to `constants.go`. +4. Add conditional entries to `createOrApplyDeployments`, `createOrApplyServices`, `createOrApplyServiceAccounts`, `createOrApplyNetworkPolicies` using the `{assetName, condition}` struct pattern. +5. Add image env var constants and wire them into `getDeploymentObject`. +6. Register the new component name in the `ComponentName` enum for network policy targeting. +7. Add the container name mapping in `getComponentNameFromAsset`. +8. Add e2e tests with appropriate Ginkgo labels and credential secret conventions. diff --git a/docs/performance-guidelines.md b/docs/performance-guidelines.md new file mode 100644 index 000000000..cbd4c439c --- /dev/null +++ b/docs/performance-guidelines.md @@ -0,0 +1,92 @@ +# Performance Guidelines + +## Cache Architecture + +### Manager-Level Label-Filtered Cache +The operator uses a single manager-level cache with per-object label selectors configured via `NewCacheBuilder()` in `pkg/controller/external_secrets/controller.go`. All controller-managed resources (Deployments, Services, RBAC, etc.) are filtered by `app=external-secrets` label. Own CRs (`ExternalSecretsConfig`, `ExternalSecretsManager`) are cached without label filters. + +- When adding a new watched resource type, register it in both `controllerManagedResources` (line ~68) and `buildCacheObjectList()` with the appropriate label selector. +- Never create a second manager-level cache for the main controller. The `crd_annotator` controller is the only one that builds its own custom cache (`BuildCustomClient`), because it watches a disjoint resource set (`CustomResourceDefinition` with different label selectors). +- Set `ReaderFailOnMissingInformer: true` on custom caches (as `crd_annotator` does) to fail fast on missing informers rather than silently returning empty results. + +### Uncached Client Usage +The operator maintains a separate uncached `CtrlClient` (`r.UncachedClient`) for objects not tracked by the cache, specifically cert-manager Issuers/ClusterIssuers and user-provided Secrets (Bitwarden secretRef). Use uncached reads only for: +- Validating existence of external resources (e.g., `assertIssuerRefExists`, `assertSecretRefExists`) +- One-time bootstrap operations (e.g., `CreateDefaultESMResource` in `setup_manager.go`) + +Never use the uncached client for objects in `controllerManagedResources` -- the cache is designed for those. + +### Conditional CRD Cache Registration +Before creating the cache, `NewCacheBuilder` checks whether the cert-manager CRD exists via `isCRDInstalled`. If absent, `certmanagerv1.Certificate` is excluded from the cache object list to prevent startup failure. When the CRD exists, an informer is explicitly registered via `mgr.GetCache().GetInformer()`. Follow this pattern for any future optional CRD dependencies. + +## Watch and Predicate Patterns + +### Predicate Strategy by Resource Type +The controller uses different predicate combinations per resource type to minimize reconciliation: + +| Resource | Watch Method | Predicates | Rationale | +|---|---|---|---| +| ExternalSecretsConfig (primary) | `For()` | `GenerationChangedPredicate` | Ignore status-only updates | +| Deployments | `Watches()` | `GenerationChangedPredicate` + `managedResources` | Skip status/replica updates | +| Secrets | `WatchesMetadata()` | `LabelChangedPredicate` | Avoid caching Secret data in memory | +| Other managed resources | `Watches()` | `managedResources` (label filter) | Only watch operator-owned objects | +| ExternalSecretsManager | `Watches()` | `GenerationChangedPredicate` + `managedResources` | Skip status-only changes | +| CRDs (crd_annotator) | `WatchesMetadata()` | `AnnotationChangedPredicate` + label filter | Only metadata matters | + +When adding new watched resources, always apply at minimum the `managedResources` predicate to avoid reconciling unrelated objects. Use `WatchesMetadata()` when only labels/annotations matter -- it avoids fetching full object bodies. + +### Map Function Convention +All controllers map events to a single reconcile key: the CR name (`common.ExternalSecretsConfigObjectName` = "cluster"). The map function checks `obj.GetLabels()[requestEnqueueLabelKey] == requestEnqueueLabelValue` and returns an empty slice for non-matching objects. Never enqueue multiple requests from a single event. + +## Reconciliation Patterns + +### Error Classification and Requeue Strategy +The operator classifies errors into two categories via `common.ReconcileError`: + +- **IrrecoverableError**: Config validation failures, permission errors, bad requests. Returns `ctrl.Result{}` with no requeue. Examples: missing CRD, invalid cert-manager config, missing env var. +- **RetryRequiredError**: Transient API failures, conflicts. Requeues with `DefaultRequeueTime` (30s). The helper `common.FromClientError()` auto-classifies based on API error type. + +Convention: never return both a `RequeueAfter` result and an error simultaneously. Return the error alone and let controller-runtime handle backoff, or return `RequeueAfter` with nil error. + +### Condition Update Optimization +Status conditions are updated only when they actually change. The pattern uses `apimeta.SetStatusCondition()` which returns a boolean, and the controller checks both `degradedChanged` and `readyChanged` before issuing a status update call. Follow this pattern to avoid unnecessary API writes. + +### CR Annotation Patching +After all resources are reconciled, CR annotations (managed-annotations tracking + processed annotation) are updated via a single `MergePatch` rather than a full object update. This reduces conflict risk and avoids overwriting spec changes made concurrently. + +## Retry and Conflict Handling + +### UpdateWithRetry Pattern +Regular object updates (non-status) use `retry.RetryOnConflict(retry.DefaultRetry, ...)`. The `UpdateWithRetry` method on `CtrlClient` follows the read-modify-write pattern: it fetches the latest `ResourceVersion`, sets it on the object, then updates. Use `UpdateWithRetry` instead of bare `Update` for any object that might be modified concurrently. + +### Status Update Pattern +Status subresource updates use `retry.RetryOnConflict(retry.DefaultRetry, ...)` directly. The pattern is: fetch latest, deep-copy desired status into it, then call `StatusUpdate`. This appears identically in all three controllers. + +### Bootstrap Resource Creation +`CreateDefaultESMResource` uses `retry.OnError` with a custom `shouldRetryOnError` function that stops on permanent errors (AlreadyExists, Conflict, Invalid, BadRequest, Unauthorized, Forbidden, TooManyRequests) and retries on transient ones. Follow this pattern for one-time resource creation at startup. + +## Concurrency Primitives + +### Resettable sync.Once (`common.Now`) +The `Now` type in `pkg/controller/common/utils.go` extends `sync.Once` with a `Reset()` method using `sync.Mutex` + `atomic.Uint32` with double-checked locking. It is used in the `external_secrets_manager` controller to emit a warning event only once per error cycle, resetting when the error resolves. Use this type when you need one-shot behavior that can be re-armed. + +No goroutines are spawned directly by controller code -- all concurrency is handled by the controller-runtime framework. Do not introduce raw goroutines in reconcile loops. + +## Drift Detection + +### HasObjectChanged Pattern +The `common.HasObjectChanged()` function uses type-specific field comparison (not full `reflect.DeepEqual` on the entire object) to detect drift. Each resource type has a dedicated comparison: +- Deployments: compares replicas, containers, volumes, affinity, tolerations, node selector, env vars individually +- RBAC: compares only Rules (Roles) or RoleRef+Subjects (Bindings) +- Services: compares Type, Ports, Selector +- Webhooks: compares individual webhook entries by name + +Annotations are compared using managed-key tracking (`annotationMapsModified`) which only checks keys the operator manages, ignoring annotations set by external controllers (e.g., `deployment.kubernetes.io/revision`). This prevents infinite reconcile loops. + +When adding a new managed resource type, add a case to the `HasObjectChanged` switch and implement field-level comparison rather than using full `DeepEqual` on the entire spec. + +## Asset Decoding +Static manifests are embedded via go-bindata (`pkg/operator/assets/bindata.go`) and decoded at reconcile time using typed `Decode*ObjBytes` helpers. Each decode call allocates a new object. These helpers `panic` on decode failure since the assets are build-time constants. Do not add error handling around these -- a panic here indicates a build or manifest corruption problem. + +## E2E Test Polling +E2E tests use `wait.PollUntilContextTimeout` with 5-second intervals and 2-minute default timeouts. Bitwarden-related tests use 4-minute timeouts due to SDK server startup latency. Follow these conventions for new e2e tests rather than using arbitrary sleep durations. diff --git a/docs/security-guidelines.md b/docs/security-guidelines.md new file mode 100644 index 000000000..af391d8a1 --- /dev/null +++ b/docs/security-guidelines.md @@ -0,0 +1,125 @@ +# Security Guidelines + +## Container Security Context (Hardened Defaults) + +All operand containers are hardened programmatically via `updateContainerSecurityContext()` in `pkg/controller/external_secrets/deployments.go`. This function is called for every container (controller, webhook, cert-controller, bitwarden-sdk-server) during deployment reconciliation. The enforced settings are: + +- `AllowPrivilegeEscalation: false` +- `Capabilities.Drop: ["ALL"]` +- `ReadOnlyRootFilesystem: true` +- `RunAsNonRoot: true` +- `RunAsUser: nil` (defers to pod-level or image default; static manifests use `1000`) +- `SeccompProfile.Type: RuntimeDefault` + +When adding a new container or deployment, always call `updateContainerSecurityContext(&container)` on the container spec. Never set `RunAsUser` to `0` or add capabilities. The reconciler drift-detects security context changes and reverts them. + +Static deployment manifests in `bindata/external-secrets/resources/` also set `hostNetwork: false` explicitly. Never change this. + +## Container Image (Non-root) + +All Dockerfiles (`Dockerfile`, `images/ci/Dockerfile`, `images/ci/operand.Dockerfile`) run as `USER 65534:65534` (nobody). Never change the USER to root or remove this line. The base image is `ubi9-minimal`. + +## HTTP/2 Disabled by Default + +In `cmd/external-secrets-operator/main.go`, HTTP/2 is disabled for both metrics and webhook servers to mitigate known HTTP/2 vulnerabilities. The `--enable-http2` flag defaults to `false`. The implementation sets `c.NextProtos = []string{"http/1.1"}` on the TLS config. Do not change this default. + +## Metrics Endpoint Security + +The metrics server uses `filters.WithAuthenticationAndAuthorization` as `FilterProvider`, enforcing authn/authz on the metrics endpoint. Secure metrics serving (HTTPS on `:8443`) is enabled by default (`--metrics-secure=true`). The OpenShift service CA (`/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt`) is loaded for client verification. When modifying metrics configuration, preserve these protections. + +## RBAC Architecture (Three Tiers) + +The repo manages RBAC at three distinct levels. Changing one level does not affect the others. + +**1. Operator's own ClusterRole** (`config/rbac/role.yaml`): Grants permissions the operator process needs. Generated from `+kubebuilder:rbac` markers in `pkg/controller/external_secrets/controller.go`. When adding new API interactions, add markers there and run `make manifests`. + +**2. Operand ClusterRoles** (static YAML in `bindata/external-secrets/resources/`): +- `clusterrole_external-secrets-controller.yml` -- main controller (secrets CRUD, ESO CRDs) +- `clusterrole_external-secrets-cert-controller.yml` -- cert-controller (webhook configs, secrets for TLS, CRDs) +- `clusterrole_external-secrets-view.yml` -- aggregated to `view`/`edit`/`admin` (read-only ESO resources) +- `clusterrole_external-secrets-edit.yml` -- aggregated to `edit`/`admin` (mutate ESO resources) +- `clusterrole_external-secrets-servicebindings.yml` -- service bindings integration (read-only ESO resources with servicebinding.io/controller label) + +The cert-controller role uses `resourceNames` restrictions on webhook configurations (`secretstore-validate`, `externalsecret-validate`) for its update permissions. Follow this pattern of scoping when adding new permissions. + +**3. Namespace-scoped roles**: `role_external-secrets-leaderelection.yml` restricts leader election to the operand namespace. + +Key convention: The controller ClusterRole grants `secrets` verbs `get/list/watch/create/update/delete/patch`. This is inherently powerful. Never broaden this (e.g., never add `*` verbs or `*` resources). + +## RBAC Label Protection + +The regex `disallowedLabelMatcher` in `install_external_secrets.go` prevents users from overriding internal labels via `ControllerConfig.Labels`: +``` +^app.kubernetes.io\/|^external-secrets.io\/|^rbac.authorization.k8s.io\/|^servicebinding.io\/controller$|^app$ +``` +When adding new internal labels, add them to this regex. + +## Annotation Domain Restrictions + +The CRD validation (CEL rules in `external_secrets_config_types.go`) blocks user annotations with reserved domain prefixes: `kubernetes.io/`, `openshift.io/`, `k8s.io/`, `cert-manager.io/`. These rules are enforced at admission time. Do not weaken them. When the operator adds its own annotations (e.g., `cert-manager.io/inject-ca-from`), it does so programmatically, bypassing user-facing validation. + +## Environment Variable Reservation + +The `ComponentConfig.OverrideEnv` field uses CEL validation to block reserved env var names/prefixes: `KUBERNETES_*`, `EXTERNAL_SECRETS_*`, `HOSTNAME`, `SSL_CERT_DIR`, `SSL_CERT_FILE`. When adding new operator-managed env vars that users should not override, add them to the CEL rule in the CRD type definition. + +## TLS Certificate Management (Dual Mode) + +Two certificate provisioning paths exist, controlled by `spec.controllerConfig.certProvider.certManager.mode`: + +**Built-in cert-controller** (default): A separate deployment (`external-secrets-cert-controller`) manages TLS certificates for the webhook. The cert-controller has its own RBAC and creates/updates the `external-secrets-webhook` Secret. The webhook reads certs from `/tmp/certs` volume mount. + +**cert-manager integration**: When enabled, `Certificate` resources are created from templates in `bindata/`. The `issuerRef` is validated to exist before creating the Certificate (via `assertIssuerRefExists`). The webhook secret name changes to `external-secrets-webhook-cm` to avoid clashing with the cert-controller secret. The `cert-manager.io/inject-ca-from` annotation is conditionally added to webhook configurations. + +Key constraint: `certManager.mode` and `issuerRef` are immutable once set (enforced via `XValidation:rule="self == oldSelf"`). + +## Bitwarden TLS Requirements + +When `BitwardenSecretManagerProvider` is enabled, TLS certificates are mandatory. Users must provide either: +- A `secretRef` pointing to a K8s Secret with keys `tls.crt`, `tls.key`, `ca.crt` (validated via `assertSecretRefExists` using an uncached client) +- cert-manager configuration to auto-generate certificates + +This is enforced at the CRD level via CEL and in controller code. The bitwarden deployment mounts certs at `/certs` and probes use `scheme: HTTPS`. + +## Network Policy Architecture (Default-Deny) + +The operator deploys a **deny-all** base NetworkPolicy (`networkpolicy_deny-all.yaml`) that blocks all ingress and egress for all pods in the operand namespace. Specific allow-policies are then layered: + +- `allow-api-server-egress-for-main-controller-traffic` -- egress to API server (port 6443) +- `allow-api-server-and-webhook-traffic` -- egress to API server + ingress on webhook port 10250 and metrics port 8080 (from monitoring namespace only) +- `allow-api-server-egress-for-cert-controller-traffic` -- only when cert-controller is active +- `allow-api-server-egress-for-bitwarden-sever` -- only when Bitwarden is enabled (note: file has typo "sever" instead of "server") +- `allow-dns` -- DNS egress + +When adding new components or network requirements, follow this pattern: add a conditional static policy in `bindata/` and register it in `createOrApplyStaticNetworkPolicies()` with the appropriate condition. + +User-defined network policies via `spec.controllerConfig.networkPolicies` are prefixed with `eso-user-` and restricted to `Egress` policy type only. The operator auto-selects pods via component label. + +## Webhook Security Configuration + +Validating webhooks use: +- `failurePolicy: Fail` (on the ExternalSecret webhook) -- rejecting requests if the webhook is unavailable +- `sideEffects: None` +- `timeoutSeconds: 5` +- Webhook listens on port `10250` (not the default 443) + +The SecretStore webhook does not explicitly set `failurePolicy`, which defaults to `Fail` in Kubernetes. Maintain `Fail` for security-critical validations. + +## Singleton Pattern (Cluster-Scoped CRs) + +Both `ExternalSecretsConfig` and `ExternalSecretsManager` are singletons enforced via CEL: `self.metadata.name == 'cluster'`. This prevents privilege escalation through multiple conflicting configurations. Never remove this validation. + +## Trusted CA Bundle Handling + +When proxy configuration is present, a ConfigMap labeled `config.openshift.io/inject-trusted-cabundle: "true"` is created. The Cluster Network Operator (CNO) populates it with cluster-wide CA certificates. The operator mounts this at `/etc/pki/tls/certs` (Go's default cert path) as a read-only volume. The operator never writes CA data directly -- it only manages the label and lets CNO handle the content. + +## Sensitive Data in Tests + +E2E tests reference AWS credentials via a well-known Secret (`aws-creds` in `kube-system`). Credential keys are `aws_access_key_id` and `aws_secret_access_key`. These are read from the cluster, never hardcoded. When adding e2e tests for new providers, follow this pattern: reference credentials from pre-existing cluster Secrets, never embed them in test code or YAML fixtures. + +## Reconciler Drift Detection + +The operator reconciles all managed resources back to desired state. RBAC rules, deployments, webhook configurations, and network policies are compared field-by-field (via `HasObjectChanged` in `pkg/controller/common/utils.go`). If any resource is externally modified (e.g., someone manually adds permissions to a ClusterRole), the operator detects the drift and reverts it. This is a critical security property -- do not disable drift detection for security-sensitive resources. + +## Error Classification for Security + +`FromClientError` in `pkg/controller/common/errors.go` classifies `Unauthorized` and `Forbidden` API errors as `IrrecoverableError`, meaning the operator will not retry. This prevents retry storms against permission boundaries. Maintain this classification when adding new API interactions. diff --git a/docs/testing-guidelines.md b/docs/testing-guidelines.md new file mode 100644 index 000000000..270acef99 --- /dev/null +++ b/docs/testing-guidelines.md @@ -0,0 +1,124 @@ +# Testing Guidelines + +## Test Architecture + +This repo has three test tiers, each in a separate Go module boundary (the repo uses `go.work` with modules at `.`, `./test`, `./cmd/external-secrets-operator`, `./tools`): + +| Tier | Location | Make Target | Framework | Cluster Required | +|------|----------|-------------|-----------|-----------------| +| Unit | `pkg/**/*_test.go` | `make test-unit` | stdlib `testing` + table-driven | No | +| API Integration | `test/apis/` | `make test-apis` | Ginkgo/Gomega + envtest | No (envtest) | +| E2E | `test/e2e/` | `make test-e2e` | Ginkgo/Gomega + live cluster | Yes | + +`make test` runs both `test-apis` and `test-unit` (no cluster needed). + +## Unit Tests (`pkg/`) + +### Conventions +- Files live alongside source in the same package (e.g., `deployments.go` / `deployments_test.go`). +- Use stdlib `testing` package exclusively; no Ginkgo. +- Follow table-driven test pattern: define a `tests` slice of structs with `name`, `preReq`, `wantErr`, and optional validation funcs, then iterate with `t.Run`. +- No build tags; unit tests compile under normal `go test`. + +### Fake Client Pattern +- Use `pkg/controller/client/fakes/FakeCtrlClient` (generated by counterfeiter) to mock the controller-runtime client. +- Set up behavior per test via `mock.ExistsCalls(...)`, `mock.CreateCalls(...)`, `mock.UpdateWithRetryCalls(...)`, etc. +- Assign mock to `r.CtrlClient = mock` on the test reconciler. + +### Test Helpers +- `pkg/controller/external_secrets/test_utils.go` provides `testReconciler(t)`, `testDeployment(name)`, `testService(name)`, `testResourceMetadata(esc)`, and typed decode helpers (`testClusterRole`, `testSecret`, etc.). +- `pkg/controller/commontest/utils.go` provides `TestExternalSecretsConfig()`, `TestExternalSecretsManager()`, `ErrTestClient`, and standard test constants (`TestExternalSecretsImageName`, `TestBitwardenImageName`, `TestExternalSecretsNamespace`). + +### Error Assertion +- Compare exact error strings: `if err == nil || err.Error() != tt.wantErr`. +- Use `t.Setenv()` for environment variables needed by the code under test (e.g., `RELATED_IMAGE_EXTERNAL_SECRETS`). + +### Adding a New Unit Test +1. Create or edit `*_test.go` next to the source file, same package. +2. Define a table-driven test with `name`, mock setup (`preReq`), optional input mutation (`updateExternalSecretsConfig`), and `wantErr` string. +3. Use `testReconciler(t)` for the reconciler, `fakes.FakeCtrlClient{}` for the client. +4. Assert with `t.Errorf`; avoid external assertion libraries in unit tests. + +## API Integration Tests (`test/apis/`) + +### Data-Driven Test Suites +- Tests are defined in YAML files at `api/v1alpha1/tests//.testsuite.yaml`. +- Each file specifies `crdName`, `tests.onCreate` (create + validate or expect error), and `tests.onUpdate` (create, update, validate or expect error). +- The Go code in `test/apis/generator.go` auto-generates Ginkgo specs from these YAML files using `DescribeTable`. + +### envtest Setup +- Suite file: `test/apis/suite_test.go`. +- Bootstraps `envtest.Environment` with CRDs from `config/crd/bases/`. +- Uses Kubernetes API version >= 1.25 (required for CEL validation). +- Uses `komega` for object comparison with `IgnoreAutogeneratedMetadata`. + +### Running +- `make test-apis` invokes `hack/test-apis.sh`, which runs Ginkgo with `--randomize-all --randomize-suites --keep-going --timeout=30m`. +- In CI (`OPENSHIFT_CI=true`), JUnit and coverage artifacts are written to `ARTIFACT_DIR`. + +### Adding a New API Test +1. Create or edit a `.testsuite.yaml` file under `api/v1alpha1/tests//`. +2. Define `onCreate` entries with `initial` YAML (input) and either `expected` YAML (success) or `expectedError` (failure substring). +3. Define `onUpdate` entries with `initial`, `updated`, `expected`/`expectedError`/`expectedStatusError`. Optionally use `initialCRDPatches` for ratcheting tests. +4. The generator picks them up automatically; no Go code changes needed. + +## E2E Tests (`test/e2e/`) + +### Build Tag +- All files in `test/e2e/` and `test/utils/` require `//go:build e2e`. The Makefile passes `-tags e2e` via `go test -C test`. + +### Label-Based Platform Selection +Tests are filtered by Ginkgo labels. The default filter is `"Platform: isSubsetOf {AWS}"`. + +| Label | Scope | Required Secrets | +|-------|-------|-----------------| +| `Platform:AWS` | AWS SecretStore + ExternalSecret + PushSecret | `aws-creds` in `kube-system` | +| `CrossPlatform:GCP-AWS` | GCP cluster using AWS Secrets Manager | `aws-creds` in `kube-system` | +| `Provider:Bitwarden` | Bitwarden ESO provider via ClusterSecretStore | `bitwarden-creds` in `external-secrets-operator` | +| `API:Bitwarden` | Direct HTTP tests to bitwarden-sdk-server | `bitwarden-creds` in `external-secrets-operator` | + +Run a specific suite: `make test-e2e E2E_GINKGO_LABEL_FILTER="Provider:Bitwarden"`. +Run all: `make test-e2e E2E_GINKGO_LABEL_FILTER=""`. + +### E2E Structure +- Suite setup (`e2e_suite_test.go`): initializes K8s clients (`kubernetes.Clientset`, `dynamic.DynamicClient`, `controller-runtime client.Client`), sets timeout with 5-minute cleanup buffer, configures JSON/JUnit reports. +- `BeforeAll`: creates test namespace (prefix `external-secrets-e2e-test-`), creates/verifies `ExternalSecretsConfig` CR, waits for operator/operand pods. +- `BeforeEach`: verifies operand pods are ready. +- `AfterEach`: on failure, dumps artifacts via `utils.DumpE2EArtifacts()` to `ARTIFACT_DIR` or `_output/`. +- `AfterAll`: deletes test namespace and cluster CR. + +### Test Data +- YAML fixtures embedded via `//go:embed testdata/*` in `e2e_test.go`. +- Pattern substitution in YAML via `utils.ReplacePatternInAsset("${PATTERN}", "value")`. +- Fixtures live in `test/e2e/testdata/`. + +### E2E Test Utilities (`test/utils/`) +Key helpers (all require `e2e` build tag): +- `DynamicResourceLoader`: create/delete K8s resources from YAML files or `Unstructured` objects. +- `VerifyPodsReadyByPrefix(ctx, clientset, namespace, prefixes)`: poll until pods with given name prefixes are Running + Ready. +- `WaitForESOResourceReady(ctx, dynamicClient, gvr, namespace, name, timeout)`: poll until ESO resource has `Ready=True` condition. +- `WaitForExternalSecretsConfigReady(ctx, dynamicClient, name, timeout)`: wait for both `Ready=True` and `Degraded=False`. +- `CleanupESOOperandAndRelated(ctx, cfg)`: best-effort cleanup of operand CRs, webhooks, RBAC, namespace. +- `GetRandomString(n)`: random alphanumeric string for unique resource names. + +### Adding a New E2E Test +1. Add `//go:build e2e` at the top of any new file. +2. Place test in `test/e2e/` package; use existing `suiteClientset`, `suiteDynamicClient`, `suiteRuntimeClient`. +3. Apply a Ginkgo `Label(...)` to the `Context` or `Describe` for platform filtering. +4. Use `Ordered` for tests with shared state dependencies. +5. Create test namespace with `GenerateName: testNamespacePrefix`; clean up in `AfterAll`. +6. Use `Eventually(...).Should(Succeed())` for async assertions with polling. +7. Use `defer loader.DeleteFromFile(...)` for resource cleanup. +8. Add YAML fixtures to `test/e2e/testdata/`. + +## Coverage + +- `make test-unit` writes `cover.out` (Go coverage profile) to the repo root. +- E2E coverage uses a special instrumented image: `make docker-build-coverage` builds from `images/ci/Dockerfile.coverage`. `make e2e-coverage-collect` gathers profiles from the running operator and optionally uploads to Codecov. + +## CI Notes + +- `GOFLAGS` is cleared for test targets (line: `fmt vet test test-unit test-e2e run update-vendor update-dep: GOFLAGS=`) to avoid conflicts with `go.work`. +- envtest uses Kubernetes `1.32.0` binaries (`ENVTEST_K8S_VERSION`). +- E2E timeout defaults to `1h` (`E2E_TIMEOUT`); Ginkgo suite timeout is set to `E2E_TIMEOUT - 5min`. +- `ARTIFACT_DIR` controls where CI artifacts (JUnit XML, JSON reports, failure dumps) are written.