Skip to content

Allow Fleet Server to reload TLS certificates without restarting#6838

Open
ycombinator wants to merge 26 commits intoelastic:mainfrom
ycombinator:pr/6835
Open

Allow Fleet Server to reload TLS certificates without restarting#6838
ycombinator wants to merge 26 commits intoelastic:mainfrom
ycombinator:pr/6835

Conversation

@ycombinator
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator commented Apr 15, 2026

What is the problem this PR solves?

Fleet Server currently requires an explicit restart to reload TLS certificates used for serving HTTPS requests to Elastic Agents. In environments with frequent certificate rotation — Kubernetes (where Secrets are remounted into pods), serverless (where cert-manager manages per-project certificates), and general ops — this creates operational burden and requires external orchestration to coordinate cert rotation with server restarts.

How does this PR solve the problem?

When TLS is enabled, Fleet Server now periodically re-reads the configured cert/key files from disk and hot-swaps them into the running TLS listener without restart. This is enabled by default and can be disabled by setting ssl.certificate_reload.enabled: false.

This follows the OTel Collector's configtls approach: on each TLS handshake, if the configured reload interval has elapsed, the cert/key files are re-read from disk and validated before being swapped in. No file watchers, no extra goroutines.

The core CertReloader implementation and the CertificateReload configuration both live in elastic-agent-libs v0.37.0 (in the tlscommon package), as part of ServerConfig, so any component using tlscommon gets cert reload support. Fleet Server simply wires it up.

Key design decisions:

  • Polling instead of file watching: Inspired by OTel Collector's approach. Simpler, no fsnotify dependency, no goroutines, no debounce logic. On each GetCertificate call, checks if the reload interval has elapsed; if so, re-reads files from disk
  • Enabled by default: Certificate reload is on unless explicitly disabled, matching the OTel Collector pattern
  • Reload interval: Defaults to 5 seconds, configurable via ssl.certificate_reload.reload_interval. Automated tooling (cert-manager, K8s secret mounts) can use the default or a shorter value, while manual rotation workflows may benefit from a longer window
  • Validation: Validates the new cert/key pair with tls.LoadX509KeyPair before swapping; invalid pairs keep the old cert active and log an error
  • Serving: Uses tls.Config.GetCertificate callback with sync.RWMutex and double-checked locking for concurrent reads on every TLS handshake
  • Existing connections: Unaffected — only new TLS handshakes see the new cert

Modified files:

  • internal/pkg/api/server.go — wires CertReloader into TLS setup when feature is enabled
  • internal/pkg/api/server_test.go — adds end-to-end cert reload test
  • fleet-server.reference.yml — documents the new config option

How to test this PR locally

1. Build Fleet Server

mage build:local

2. Generate a CA and server cert/key pair

mkdir -p /tmp/tls-reload-test
cd /tmp/tls-reload-test

# Generate CA
openssl req -x509 -newkey rsa:2048 -keyout ca-key.pem -out ca.pem -days 1 -nodes -subj "/CN=Test CA"

# Generate server cert signed by the CA
openssl req -newkey rsa:2048 -keyout server-key.pem -out server.csr -nodes \
  -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
openssl x509 -req -in server.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial \
  -out server-cert.pem -days 1 -copy_extensions copyall

3. Create a config file

cat > /tmp/tls-reload-test/fleet-server.yml << 'CONF'
output:
  elasticsearch:
    hosts: ['localhost:9200']
    service_token: 'fake-token'

fleet:
  agent:
    id: test-agent-id

inputs:
  - type: fleet-server
    server:
      host: localhost
      port: 8220
      ssl:
        enabled: true
        certificate: /tmp/tls-reload-test/server-cert.pem
        key: /tmp/tls-reload-test/server-key.pem
CONF

4. Start Fleet Server

bin/fleet-server -c /tmp/tls-reload-test/fleet-server.yml \
  -E logging.to_files=true \
  -E logging.to_stderr=false \
  -E logging.files.path=/tmp/tls-reload-test \
  -E logging.files.name=fleet-server.log &

Wait for the "Listening on localhost:8220" log line:

grep "Listening" /tmp/tls-reload-test/fleet-server.log-*.ndjson

5. Make a TLS connection and note the server cert fingerprint

openssl s_client -connect localhost:8220 -CAfile /tmp/tls-reload-test/ca.pem \
  -servername localhost </dev/null 2>&1 | openssl x509 -noout -serial -fingerprint

6. Replace the cert/key files with a new pair (same CA)

# Generate the new cert/key outside the watched directory
openssl req -newkey rsa:2048 -keyout /tmp/server-key2.pem -out /tmp/server2.csr -nodes \
  -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
openssl x509 -req -in /tmp/server2.csr -CA /tmp/tls-reload-test/ca.pem \
  -CAkey /tmp/tls-reload-test/ca-key.pem -CAcreateserial \
  -out /tmp/server-cert2.pem -days 1 -copy_extensions copyall

# Move the new files into the watched directory
mv /tmp/server-cert2.pem /tmp/tls-reload-test/server-cert.pem
mv /tmp/server-key2.pem /tmp/tls-reload-test/server-key.pem

7. Wait ~5 seconds for the reload interval, then make another TLS connection

openssl s_client -connect localhost:8220 -CAfile /tmp/tls-reload-test/ca.pem \
  -servername localhost </dev/null 2>&1 | openssl x509 -noout -serial -fingerprint

The fingerprint should differ from step 5.

8. Stop Fleet Server

kill %1

Automated tests:

  • go test -run Test_server_TLSCertReload ./internal/pkg/api/... — end-to-end integration test

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@ycombinator ycombinator added the enhancement New feature or request label Apr 15, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 15, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 15, 2026
@ycombinator ycombinator requested a review from swiatekm April 15, 2026 18:48
@ycombinator ycombinator marked this pull request as ready for review April 15, 2026 18:48
@ycombinator ycombinator requested a review from a team as a code owner April 15, 2026 18:48
@ycombinator ycombinator requested a review from blakerouse April 15, 2026 18:48
Copy link
Copy Markdown
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about changing this implementation? We talked about this at a team sync meeting and I was supposed to make an issue for it, but honestly I forgot until I was reading this PR.

https://github.com/open-telemetry/opentelemetry-collector/blob/7201768d04fb628e65cb4e3d5ead69d17c43d59d/config/configtls/configtls.go#L152

That design of config reload is something we discussed in the call. We overall liked the simple design no need for fsnotify or anything. We discussed about adding it to elastic-agent-libs so it could be used in multiple places.

Maybe we could do the implementation here first in Fleet Server and then split it out later.

@ycombinator
Copy link
Copy Markdown
Contributor Author

@blakerouse Sure, I like the simplicity of the OTel implementation. It means having to wait up to ReloadInterval for the new cert to be picked up, but I think that's an acceptable tradeoff for the simpler implementation. I'll implement it in elastic-agent-libs and then update this PR here to use that implementation.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 18, 2026

This pull request is now in conflicts. Could you fix it @ycombinator? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b pr/6835 upstream/pr/6835
git merge upstream/main
git push upstream pr/6835

Comment thread go.mod Outdated
@ycombinator
Copy link
Copy Markdown
Contributor Author

@blakerouse Check out elastic/elastic-agent-libs#404. Also, I've updated this PR here to use CertReloader from that PR.

Comment thread internal/pkg/config/input.go Outdated
Comment thread internal/pkg/config/input.go Outdated
blakerouse
blakerouse previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good.

swiatekm
swiatekm previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Member

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one non-blocking question.

Comment thread internal/pkg/api/server_test.go Outdated
@ycombinator ycombinator dismissed stale reviews from swiatekm and blakerouse via be9513a April 22, 2026 14:12
@ycombinator ycombinator force-pushed the pr/6835 branch 5 times, most recently from cc1fa83 to a6c9f08 Compare April 22, 2026 14:24
ycombinator and others added 26 commits April 24, 2026 14:38
Add github.com/fsnotify/fsnotify v1.9.0 as a direct dependency.
This will be used to watch TLS certificate and key files for changes,
enabling hot-reload without server restart (issue elastic#6433).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce ServerTLSConfig that wraps tlscommon.ServerConfig with an
additional CertificateReload field, enabling opt-in TLS certificate
hot-reload. The new config lives under ssl.certificate_reload.enabled
and defaults to false (disabled).

Update Server.TLS field type from *tlscommon.ServerConfig to
*ServerTLSConfig. Promoted methods (IsEnabled, Validate, DiagCerts)
continue to work through embedding. Update call sites in server.go
and server_test.go to use the new wrapper type.

Part of elastic#6433.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new reload/tls package with a CertReloader type that watches TLS
certificate and key files on disk using fsnotify and atomically reloads
them when changes are detected.

Key design decisions:
- Watches parent directories (not files directly) to handle atomic
  rename/replace used by cert-manager and Kubernetes secret mounts
- Debounces file change events (default 5s) to handle non-atomic writes
  of cert and key as separate operations
- Validates new cert/key pair with tls.LoadX509KeyPair before swapping;
  invalid pairs keep old cert active and log an error
- Uses atomic.Pointer[tls.Certificate] for lock-free concurrent reads
- Exposes GetCertificate callback for tls.Config.GetCertificate

Part of elastic#6433.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover:
- Valid and invalid cert pair loading
- Missing files and empty paths
- Certificate change detection and reload after debounce
- Invalid new cert keeps old cert active
- Debounce timer reset on additional file changes
- Clean shutdown on context cancellation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ssl.certificate_reload.enabled is true, the server now creates a
CertReloader that watches cert/key files for changes. The reloader's
GetCertificate callback is set on the tls.Config so that every new TLS
handshake serves the latest certificate. The static Certificates slice
is cleared to ensure Go's TLS library always uses GetCertificate.

The reloader goroutine is tied to the server's context and shuts down
cleanly when the server stops.

When the feature is disabled (default), the code path is unchanged.

Part of elastic#6433.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Test_server_TLSCertReload that verifies end-to-end certificate
rotation through a running server:
1. Starts server with certificate_reload enabled
2. Makes HTTPS request, captures server cert from TLS handshake
3. Writes a new cert/key pair to disk
4. Waits for debounce period
5. Makes another HTTPS request and asserts the cert has changed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new ssl.certificate_reload.enabled setting in the
reference configuration file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run mage check:imports to fix struct field alignment in input.go.
Run mage check:notice to regenerate NOTICE files after adding
fsnotify as a direct dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improve readability of test cases by adding comments that explain
the setup, the action being tested, and what each assertion verifies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tests

Replace fixed-duration sleeps with polling assertions to make tests
less flaky and faster. Also reduces the debounce period in
TestReload_Debounce from 500ms to 200ms for quicker test execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the debounce=0 sentinel value with a WithDebounce option,
making the default debounce implicit rather than relying on a magic
zero value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a `debounce` duration field to `ssl.certificate_reload` so users
can tune the delay between detecting a file change and reloading the
cert/key pair. Defaults to 5s when not specified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the local internal/pkg/reload/tls package with the new
CertReloader from elastic-agent-libs/transport/tlscommon. This uses
a simpler polling-based design (inspired by the OTel Collector's
configtls) instead of fsnotify, removing the need for file watchers
and extra goroutines.

Also renames the config field from `debounce` to `reload_interval`
to better reflect the new semantics.

Note: go.mod has a temporary replace directive pointing to the
local elastic-agent-libs checkout. This will be updated to a
released version once elastic/elastic-agent-libs#404 is merged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the local filesystem replace directive with a pseudo-version
pointing to ycombinator/elastic-agent-libs@74c467d3bcab (the
CertReloader commit from elastic-agent-libs PR elastic#404).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CertificateReload config has been moved into tlscommon.ServerConfig in
elastic-agent-libs, removing the need for the ServerTLSConfig wrapper
and custom Unpack method in fleet-server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CertificateReload is now enabled by default (Enabled is *bool where nil
means enabled). Use the new IsEnabled() helper and remove explicit
Enabled = true in tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove fork replace directive now that CertReloader and CertificateReload
config are available in the official v0.37.0 release.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…diness

Use require.EventuallyWithT with an HTTPS request to /api/status instead
of a 500ms sleep to detect when the server is ready.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Fleet Server to reload TLS certificates without restarting

3 participants