Allow Fleet Server to reload TLS certificates without restarting#6838
Allow Fleet Server to reload TLS certificates without restarting#6838ycombinator wants to merge 26 commits intoelastic:mainfrom
Conversation
|
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
|
blakerouse
left a comment
There was a problem hiding this comment.
How would you feel about changing this implementation? We talked about this at a team sync meeting and I was supposed to make an issue for it, but honestly I forgot until I was reading this PR.
That design of config reload is something we discussed in the call. We overall liked the simple design no need for fsnotify or anything. We discussed about adding it to elastic-agent-libs so it could be used in multiple places.
Maybe we could do the implementation here first in Fleet Server and then split it out later.
|
@blakerouse Sure, I like the simplicity of the OTel implementation. It means having to wait up to |
|
This pull request is now in conflicts. Could you fix it @ycombinator? 🙏 |
|
@blakerouse Check out elastic/elastic-agent-libs#404. Also, I've updated this PR here to use |
swiatekm
left a comment
There was a problem hiding this comment.
LGTM, one non-blocking question.
cc1fa83 to
a6c9f08
Compare
Add github.com/fsnotify/fsnotify v1.9.0 as a direct dependency. This will be used to watch TLS certificate and key files for changes, enabling hot-reload without server restart (issue elastic#6433). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce ServerTLSConfig that wraps tlscommon.ServerConfig with an additional CertificateReload field, enabling opt-in TLS certificate hot-reload. The new config lives under ssl.certificate_reload.enabled and defaults to false (disabled). Update Server.TLS field type from *tlscommon.ServerConfig to *ServerTLSConfig. Promoted methods (IsEnabled, Validate, DiagCerts) continue to work through embedding. Update call sites in server.go and server_test.go to use the new wrapper type. Part of elastic#6433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new reload/tls package with a CertReloader type that watches TLS certificate and key files on disk using fsnotify and atomically reloads them when changes are detected. Key design decisions: - Watches parent directories (not files directly) to handle atomic rename/replace used by cert-manager and Kubernetes secret mounts - Debounces file change events (default 5s) to handle non-atomic writes of cert and key as separate operations - Validates new cert/key pair with tls.LoadX509KeyPair before swapping; invalid pairs keep old cert active and log an error - Uses atomic.Pointer[tls.Certificate] for lock-free concurrent reads - Exposes GetCertificate callback for tls.Config.GetCertificate Part of elastic#6433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover: - Valid and invalid cert pair loading - Missing files and empty paths - Certificate change detection and reload after debounce - Invalid new cert keeps old cert active - Debounce timer reset on additional file changes - Clean shutdown on context cancellation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ssl.certificate_reload.enabled is true, the server now creates a CertReloader that watches cert/key files for changes. The reloader's GetCertificate callback is set on the tls.Config so that every new TLS handshake serves the latest certificate. The static Certificates slice is cleared to ensure Go's TLS library always uses GetCertificate. The reloader goroutine is tied to the server's context and shuts down cleanly when the server stops. When the feature is disabled (default), the code path is unchanged. Part of elastic#6433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Test_server_TLSCertReload that verifies end-to-end certificate rotation through a running server: 1. Starts server with certificate_reload enabled 2. Makes HTTPS request, captures server cert from TLS handshake 3. Writes a new cert/key pair to disk 4. Waits for debounce period 5. Makes another HTTPS request and asserts the cert has changed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new ssl.certificate_reload.enabled setting in the reference configuration file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run mage check:imports to fix struct field alignment in input.go. Run mage check:notice to regenerate NOTICE files after adding fsnotify as a direct dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improve readability of test cases by adding comments that explain the setup, the action being tested, and what each assertion verifies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tests Replace fixed-duration sleeps with polling assertions to make tests less flaky and faster. Also reduces the debounce period in TestReload_Debounce from 500ms to 200ms for quicker test execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the debounce=0 sentinel value with a WithDebounce option, making the default debounce implicit rather than relying on a magic zero value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a `debounce` duration field to `ssl.certificate_reload` so users can tune the delay between detecting a file change and reloading the cert/key pair. Defaults to 5s when not specified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the local internal/pkg/reload/tls package with the new CertReloader from elastic-agent-libs/transport/tlscommon. This uses a simpler polling-based design (inspired by the OTel Collector's configtls) instead of fsnotify, removing the need for file watchers and extra goroutines. Also renames the config field from `debounce` to `reload_interval` to better reflect the new semantics. Note: go.mod has a temporary replace directive pointing to the local elastic-agent-libs checkout. This will be updated to a released version once elastic/elastic-agent-libs#404 is merged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the local filesystem replace directive with a pseudo-version pointing to ycombinator/elastic-agent-libs@74c467d3bcab (the CertReloader commit from elastic-agent-libs PR elastic#404). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CertificateReload config has been moved into tlscommon.ServerConfig in elastic-agent-libs, removing the need for the ServerTLSConfig wrapper and custom Unpack method in fleet-server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CertificateReload is now enabled by default (Enabled is *bool where nil means enabled). Use the new IsEnabled() helper and remove explicit Enabled = true in tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove fork replace directive now that CertReloader and CertificateReload config are available in the official v0.37.0 release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…diness Use require.EventuallyWithT with an HTTPS request to /api/status instead of a 500ms sleep to detect when the server is ready. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What is the problem this PR solves?
Fleet Server currently requires an explicit restart to reload TLS certificates used for serving HTTPS requests to Elastic Agents. In environments with frequent certificate rotation — Kubernetes (where Secrets are remounted into pods), serverless (where cert-manager manages per-project certificates), and general ops — this creates operational burden and requires external orchestration to coordinate cert rotation with server restarts.
How does this PR solve the problem?
When TLS is enabled, Fleet Server now periodically re-reads the configured cert/key files from disk and hot-swaps them into the running TLS listener without restart. This is enabled by default and can be disabled by setting
ssl.certificate_reload.enabled: false.This follows the OTel Collector's
configtlsapproach: on each TLS handshake, if the configured reload interval has elapsed, the cert/key files are re-read from disk and validated before being swapped in. No file watchers, no extra goroutines.The core
CertReloaderimplementation and theCertificateReloadconfiguration both live inelastic-agent-libsv0.37.0 (in thetlscommonpackage), as part ofServerConfig, so any component usingtlscommongets cert reload support. Fleet Server simply wires it up.Key design decisions:
fsnotifydependency, no goroutines, no debounce logic. On eachGetCertificatecall, checks if the reload interval has elapsed; if so, re-reads files from diskssl.certificate_reload.reload_interval. Automated tooling (cert-manager, K8s secret mounts) can use the default or a shorter value, while manual rotation workflows may benefit from a longer windowtls.LoadX509KeyPairbefore swapping; invalid pairs keep the old cert active and log an errortls.Config.GetCertificatecallback withsync.RWMutexand double-checked locking for concurrent reads on every TLS handshakeModified files:
internal/pkg/api/server.go— wiresCertReloaderinto TLS setup when feature is enabledinternal/pkg/api/server_test.go— adds end-to-end cert reload testfleet-server.reference.yml— documents the new config optionHow to test this PR locally
1. Build Fleet Server
2. Generate a CA and server cert/key pair
3. Create a config file
4. Start Fleet Server
bin/fleet-server -c /tmp/tls-reload-test/fleet-server.yml \ -E logging.to_files=true \ -E logging.to_stderr=false \ -E logging.files.path=/tmp/tls-reload-test \ -E logging.files.name=fleet-server.log &Wait for the
"Listening on localhost:8220"log line:5. Make a TLS connection and note the server cert fingerprint
6. Replace the cert/key files with a new pair (same CA)
7. Wait ~5 seconds for the reload interval, then make another TLS connection
The fingerprint should differ from step 5.
8. Stop Fleet Server
kill %1Automated tests:
go test -run Test_server_TLSCertReload ./internal/pkg/api/...— end-to-end integration testDesign Checklist
Checklist
./changelog/fragmentsusing the changelog toolRelated issues