Fix redis split-brain after pod-0 restart during failover by lmiccini · Pull Request #563 · openstack-k8s-operators/infra-operator

lmiccini · 2026-04-13T09:34:19Z

When redis-redis-0 (the bootstrap pod) is deleted during a failover,
it restarts and tries to contact sentinel to find the current master.
Three problems caused it to fall through to the bootstrap path and
start a new independent master, creating a split-brain:

Single-try timeout: if sentinel was momentarily unreachable (e.g.
the sentinel container on pod-0 itself was still starting), the
3-second timeout expired and pod-0 immediately bootstrapped.
Headless service DNS: with PublishNotReadyAddresses: true, the
headless service DNS can resolve to pod-0's own IP, so redis-cli
connects to its own uninitialized sentinel instead of a peer.
Stale master identity: even when contacting a peer sentinel, it
may still report the restarting pod as master (within the
down-after-milliseconds window before failover completes).

Fix by adding a wait_for_master() function in common.sh that:

Contacts each peer pod individually by FQDN (skipping self)
Retries up to 10 times (30s total) before allowing bootstrap
Rejects answers where the peer still thinks we are master

Also increase InitialDelaySeconds to 40s on all redis and sentinel
probes so Kubernetes doesn't kill the pod before the retry loop
completes, and remove unused TCP probe variables that were never
referenced by the redis container.

openshift-ci · 2026-04-13T09:34:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [lmiccini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

softwarefactory-project-zuul · 2026-04-14T05:29:16Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8f83828570fd4223bc37828313598095

❌ openstack-k8s-operators-content-provider FAILURE in 3m 57s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

softwarefactory-project-zuul · 2026-04-14T14:04:46Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/e900803aad9e46ae987e0791bb2d0767

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 55m 05s
❌ podified-multinode-edpm-deployment-crc FAILURE in 23m 51s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 39m 21s

lmiccini · 2026-04-17T08:18:38Z

recheck

dciabrin · 2026-04-17T11:25:03Z

+            fi
+            ordinal=$((ordinal + 1))
+        done
+        log "Attempt $i/$retries: no valid master found, retrying in ${delay}s..."


I'm not sure what would happen at the very first start? It looks like all pods would retry for a long time because there's no master sentinel yet?

+1, changed to return if no peers are reachable.

dciabrin · 2026-04-17T11:51:51Z

+# If a peer still reports US as master (stale info before
+# down-after-milliseconds triggers failover), keeps retrying until
+# failover completes and a different master is elected.
+# If no peers are reachable at all (first deployment), returns


I get the improvement to not wait, but I have the feeling this suffers from the same point as the previous revision.
On the initial deployement, as long as only one pod is started, this will be the master. However if for a weird reason, all pod restart after some run (e.g. you forcibly stop redis on all pod at once), 3 pods will restart concurrently, so all pod will determine that no sentinel is running, and they will start in master, leaving a split brain.
Did I get that right, or am I missing the point entirely?

as far as I understood it it should work like:

All 3 pods restart, all call wait_for_master

No peers are reachable → all return failure immediately

Pod-0: is_bootstrap_pod → true → bootstraps as master

Pod-1, Pod-2: is_bootstrap_pod → false → exit 1, restart

On their next restart, pod-0's sentinel is up → they join as replicas

oh I missed the is_bootstrap_pod. OK this makes sense to me now, thanks.

dciabrin · 2026-04-17T12:58:57Z

-		// TODO might need tuning
 		TimeoutSeconds:      5,
 		PeriodSeconds:       5,
-		InitialDelaySeconds: 5,


I think we want to keep the probes for the main container, if only for the specific TCP ports 6379?

I am just removing these unused variables, the actual probe is left intact (it is using the inline exec-based probes calling redis_probe.sh)

softwarefactory-project-zuul · 2026-04-17T13:31:45Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c614e3f0329b4c54be01aa9eaea155d1

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 50m 39s
❌ podified-multinode-edpm-deployment-crc RETRY_LIMIT in 21m 54s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 35m 54s

lmiccini · 2026-04-17T15:19:15Z

/test infra-operator-build-deploy-kuttl

lmiccini · 2026-04-17T15:43:55Z

recheck

softwarefactory-project-zuul · 2026-04-17T17:44:01Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/82bf35073ce24401ba52f1acfce617cb

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 58m 44s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 24m 26s
❌ cifmw-crc-podified-edpm-baremetal FAILURE in 39m 13s

softwarefactory-project-zuul · 2026-04-20T09:07:58Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5747e97a4c924e13b704cb4c1e107bf5

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 31m 44s
❌ podified-multinode-edpm-deployment-crc FAILURE in 32m 23s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 26m 18s

When redis-redis-0 (the bootstrap pod) is deleted during a failover, it restarts and tries to contact sentinel to find the current master. Three problems caused it to fall through to the bootstrap path and start a new independent master, creating a split-brain: 1. Single-try timeout: if sentinel was momentarily unreachable (e.g. the sentinel container on pod-0 itself was still starting), the 3-second timeout expired and pod-0 immediately bootstrapped. 2. Headless service DNS: with PublishNotReadyAddresses: true, the headless service DNS can resolve to pod-0's own IP, so redis-cli connects to its own uninitialized sentinel instead of a peer. 3. Stale master identity: even when contacting a peer sentinel, it may still report the restarting pod as master (within the down-after-milliseconds window before failover completes). Fix by adding a wait_for_master() function in common.sh that: - Contacts each peer pod individually by FQDN (skipping self) - Retries up to 10 times (30s total) before allowing bootstrap - Rejects answers where the peer still thinks we are master Also increase InitialDelaySeconds to 40s on all redis and sentinel probes so Kubernetes doesn't kill the pod before the retry loop completes, and remove unused TCP probe variables that were never referenced by the redis container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lmiccini · 2026-04-24T15:05:35Z

/test infra-operator-build-deploy-kuttl

lmiccini · 2026-04-25T07:45:34Z

/test infra-operator-build-deploy-kuttl

openshift-ci Bot requested review from dprince and stuggi April 13, 2026 09:34

openshift-ci Bot added the approved label Apr 13, 2026

lmiccini mentioned this pull request Apr 13, 2026

[DNM] test flaky Redis failover test #521

Closed

lmiccini added the do-not-merge/work-in-progress label Apr 13, 2026

lmiccini force-pushed the fix-redis-bootstrap-race branch from 6464ca2 to 618c399 Compare April 14, 2026 12:08

lmiccini requested review from dciabrin and removed request for dprince April 14, 2026 12:26

openshift-ci Bot removed the do-not-merge/work-in-progress label Apr 14, 2026

lmiccini added the do-not-merge/hold label Apr 14, 2026

dciabrin requested changes Apr 17, 2026

View reviewed changes

openshift-ci Bot assigned dciabrin Apr 17, 2026

lmiccini force-pushed the fix-redis-bootstrap-race branch from 618c399 to 0133db9 Compare April 17, 2026 11:40

dciabrin reviewed Apr 17, 2026

View reviewed changes

lmiccini force-pushed the fix-redis-bootstrap-race branch from 0133db9 to 5fc77f3 Compare April 20, 2026 06:35

lmiccini force-pushed the fix-redis-bootstrap-race branch from 5fc77f3 to bb1e9bc Compare April 20, 2026 09:44

lmiccini force-pushed the fix-redis-bootstrap-race branch from bb1e9bc to 8bb9d35 Compare April 20, 2026 12:33

Conversation

lmiccini commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Apr 13, 2026

Uh oh!

softwarefactory-project-zuul Bot commented Apr 14, 2026

Uh oh!

softwarefactory-project-zuul Bot commented Apr 14, 2026

Uh oh!

lmiccini commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

dciabrin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lmiccini Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

dciabrin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lmiccini Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

dciabrin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

dciabrin Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lmiccini Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

softwarefactory-project-zuul Bot commented Apr 17, 2026

Uh oh!

lmiccini commented Apr 17, 2026

Uh oh!

lmiccini commented Apr 17, 2026

Uh oh!

softwarefactory-project-zuul Bot commented Apr 17, 2026

Uh oh!

softwarefactory-project-zuul Bot commented Apr 20, 2026

Uh oh!

lmiccini commented Apr 24, 2026

Uh oh!

lmiccini commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lmiccini commented Apr 13, 2026 •

edited

Loading