Fix redis split-brain after pod-0 restart during failover#563
Fix redis split-brain after pod-0 restart during failover#563lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lmiccini The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8f83828570fd4223bc37828313598095 ❌ openstack-k8s-operators-content-provider FAILURE in 3m 57s |
6464ca2 to
618c399
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/e900803aad9e46ae987e0791bb2d0767 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 55m 05s |
|
recheck |
| fi | ||
| ordinal=$((ordinal + 1)) | ||
| done | ||
| log "Attempt $i/$retries: no valid master found, retrying in ${delay}s..." |
There was a problem hiding this comment.
I'm not sure what would happen at the very first start? It looks like all pods would retry for a long time because there's no master sentinel yet?
There was a problem hiding this comment.
+1, changed to return if no peers are reachable.
618c399 to
0133db9
Compare
| # If a peer still reports US as master (stale info before | ||
| # down-after-milliseconds triggers failover), keeps retrying until | ||
| # failover completes and a different master is elected. | ||
| # If no peers are reachable at all (first deployment), returns |
There was a problem hiding this comment.
I get the improvement to not wait, but I have the feeling this suffers from the same point as the previous revision.
On the initial deployement, as long as only one pod is started, this will be the master. However if for a weird reason, all pod restart after some run (e.g. you forcibly stop redis on all pod at once), 3 pods will restart concurrently, so all pod will determine that no sentinel is running, and they will start in master, leaving a split brain.
Did I get that right, or am I missing the point entirely?
There was a problem hiding this comment.
as far as I understood it it should work like:
- All 3 pods restart, all call wait_for_master
- No peers are reachable → all return failure immediately
- Pod-0: is_bootstrap_pod → true → bootstraps as master
- Pod-1, Pod-2: is_bootstrap_pod → false → exit 1, restart
- On their next restart, pod-0's sentinel is up → they join as replicas
There was a problem hiding this comment.
oh I missed the is_bootstrap_pod. OK this makes sense to me now, thanks.
| // TODO might need tuning | ||
| TimeoutSeconds: 5, | ||
| PeriodSeconds: 5, | ||
| InitialDelaySeconds: 5, |
There was a problem hiding this comment.
I think we want to keep the probes for the main container, if only for the specific TCP ports 6379?
There was a problem hiding this comment.
I am just removing these unused variables, the actual probe is left intact (it is using the inline exec-based probes calling redis_probe.sh)
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c614e3f0329b4c54be01aa9eaea155d1 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 50m 39s |
|
/test infra-operator-build-deploy-kuttl |
|
recheck |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/82bf35073ce24401ba52f1acfce617cb ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 58m 44s |
0133db9 to
5fc77f3
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5747e97a4c924e13b704cb4c1e107bf5 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 31m 44s |
5fc77f3 to
bb1e9bc
Compare
When redis-redis-0 (the bootstrap pod) is deleted during a failover, it restarts and tries to contact sentinel to find the current master. Three problems caused it to fall through to the bootstrap path and start a new independent master, creating a split-brain: 1. Single-try timeout: if sentinel was momentarily unreachable (e.g. the sentinel container on pod-0 itself was still starting), the 3-second timeout expired and pod-0 immediately bootstrapped. 2. Headless service DNS: with PublishNotReadyAddresses: true, the headless service DNS can resolve to pod-0's own IP, so redis-cli connects to its own uninitialized sentinel instead of a peer. 3. Stale master identity: even when contacting a peer sentinel, it may still report the restarting pod as master (within the down-after-milliseconds window before failover completes). Fix by adding a wait_for_master() function in common.sh that: - Contacts each peer pod individually by FQDN (skipping self) - Retries up to 10 times (30s total) before allowing bootstrap - Rejects answers where the peer still thinks we are master Also increase InitialDelaySeconds to 40s on all redis and sentinel probes so Kubernetes doesn't kill the pod before the retry loop completes, and remove unused TCP probe variables that were never referenced by the redis container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bb1e9bc to
8bb9d35
Compare
|
/test infra-operator-build-deploy-kuttl |
1 similar comment
|
/test infra-operator-build-deploy-kuttl |
When redis-redis-0 (the bootstrap pod) is deleted during a failover,
it restarts and tries to contact sentinel to find the current master.
Three problems caused it to fall through to the bootstrap path and
start a new independent master, creating a split-brain:
Single-try timeout: if sentinel was momentarily unreachable (e.g.
the sentinel container on pod-0 itself was still starting), the
3-second timeout expired and pod-0 immediately bootstrapped.
Headless service DNS: with PublishNotReadyAddresses: true, the
headless service DNS can resolve to pod-0's own IP, so redis-cli
connects to its own uninitialized sentinel instead of a peer.
Stale master identity: even when contacting a peer sentinel, it
may still report the restarting pod as master (within the
down-after-milliseconds window before failover completes).
Fix by adding a wait_for_master() function in common.sh that:
Also increase InitialDelaySeconds to 40s on all redis and sentinel
probes so Kubernetes doesn't kill the pod before the retry loop
completes, and remove unused TCP probe variables that were never
referenced by the redis container.