Skip to content

Fix redis split-brain after pod-0 restart during failover#563

Open
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:fix-redis-bootstrap-race
Open

Fix redis split-brain after pod-0 restart during failover#563
lmiccini wants to merge 1 commit intoopenstack-k8s-operators:mainfrom
lmiccini:fix-redis-bootstrap-race

Conversation

@lmiccini
Copy link
Copy Markdown
Contributor

@lmiccini lmiccini commented Apr 13, 2026

When redis-redis-0 (the bootstrap pod) is deleted during a failover,
it restarts and tries to contact sentinel to find the current master.
Three problems caused it to fall through to the bootstrap path and
start a new independent master, creating a split-brain:

  1. Single-try timeout: if sentinel was momentarily unreachable (e.g.
    the sentinel container on pod-0 itself was still starting), the
    3-second timeout expired and pod-0 immediately bootstrapped.

  2. Headless service DNS: with PublishNotReadyAddresses: true, the
    headless service DNS can resolve to pod-0's own IP, so redis-cli
    connects to its own uninitialized sentinel instead of a peer.

  3. Stale master identity: even when contacting a peer sentinel, it
    may still report the restarting pod as master (within the
    down-after-milliseconds window before failover completes).

Fix by adding a wait_for_master() function in common.sh that:

  • Contacts each peer pod individually by FQDN (skipping self)
  • Retries up to 10 times (30s total) before allowing bootstrap
  • Rejects answers where the peer still thinks we are master

Also increase InitialDelaySeconds to 40s on all redis and sentinel
probes so Kubernetes doesn't kill the pod before the retry loop
completes, and remove unused TCP probe variables that were never
referenced by the redis container.

@openshift-ci openshift-ci Bot requested review from dprince and stuggi April 13, 2026 09:34
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8f83828570fd4223bc37828313598095

openstack-k8s-operators-content-provider FAILURE in 3m 57s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@lmiccini lmiccini force-pushed the fix-redis-bootstrap-race branch from 6464ca2 to 618c399 Compare April 14, 2026 12:08
@lmiccini lmiccini requested review from dciabrin and removed request for dprince April 14, 2026 12:26
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/e900803aad9e46ae987e0791bb2d0767

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 55m 05s
podified-multinode-edpm-deployment-crc FAILURE in 23m 51s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 39m 21s

@lmiccini
Copy link
Copy Markdown
Contributor Author

recheck

Comment thread templates/redis/bin/common.sh Outdated
Comment thread templates/redis/bin/common.sh Outdated
Comment thread templates/redis/bin/common.sh Outdated
fi
ordinal=$((ordinal + 1))
done
log "Attempt $i/$retries: no valid master found, retrying in ${delay}s..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what would happen at the very first start? It looks like all pods would retry for a long time because there's no master sentinel yet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, changed to return if no peers are reachable.

@lmiccini lmiccini force-pushed the fix-redis-bootstrap-race branch from 618c399 to 0133db9 Compare April 17, 2026 11:40
Comment thread templates/redis/bin/common.sh Outdated
# If a peer still reports US as master (stale info before
# down-after-milliseconds triggers failover), keeps retrying until
# failover completes and a different master is elected.
# If no peers are reachable at all (first deployment), returns
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the improvement to not wait, but I have the feeling this suffers from the same point as the previous revision.
On the initial deployement, as long as only one pod is started, this will be the master. However if for a weird reason, all pod restart after some run (e.g. you forcibly stop redis on all pod at once), 3 pods will restart concurrently, so all pod will determine that no sentinel is running, and they will start in master, leaving a split brain.
Did I get that right, or am I missing the point entirely?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I understood it it should work like:

  1. All 3 pods restart, all call wait_for_master
  2. No peers are reachable → all return failure immediately
  3. Pod-0: is_bootstrap_pod → true → bootstraps as master
  4. Pod-1, Pod-2: is_bootstrap_pod → false → exit 1, restart
  5. On their next restart, pod-0's sentinel is up → they join as replicas

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I missed the is_bootstrap_pod. OK this makes sense to me now, thanks.

// TODO might need tuning
TimeoutSeconds: 5,
PeriodSeconds: 5,
InitialDelaySeconds: 5,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to keep the probes for the main container, if only for the specific TCP ports 6379?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just removing these unused variables, the actual probe is left intact (it is using the inline exec-based probes calling redis_probe.sh)

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c614e3f0329b4c54be01aa9eaea155d1

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 50m 39s
podified-multinode-edpm-deployment-crc RETRY_LIMIT in 21m 54s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 35m 54s

@lmiccini
Copy link
Copy Markdown
Contributor Author

/test infra-operator-build-deploy-kuttl

@lmiccini
Copy link
Copy Markdown
Contributor Author

recheck

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/82bf35073ce24401ba52f1acfce617cb

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 58m 44s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 24m 26s
cifmw-crc-podified-edpm-baremetal FAILURE in 39m 13s

@lmiccini lmiccini force-pushed the fix-redis-bootstrap-race branch from 0133db9 to 5fc77f3 Compare April 20, 2026 06:35
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5747e97a4c924e13b704cb4c1e107bf5

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 31m 44s
podified-multinode-edpm-deployment-crc FAILURE in 32m 23s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 26m 18s

@lmiccini lmiccini force-pushed the fix-redis-bootstrap-race branch from 5fc77f3 to bb1e9bc Compare April 20, 2026 09:44
When redis-redis-0 (the bootstrap pod) is deleted during a failover,
it restarts and tries to contact sentinel to find the current master.
Three problems caused it to fall through to the bootstrap path and
start a new independent master, creating a split-brain:

1. Single-try timeout: if sentinel was momentarily unreachable (e.g.
   the sentinel container on pod-0 itself was still starting), the
   3-second timeout expired and pod-0 immediately bootstrapped.

2. Headless service DNS: with PublishNotReadyAddresses: true, the
   headless service DNS can resolve to pod-0's own IP, so redis-cli
   connects to its own uninitialized sentinel instead of a peer.

3. Stale master identity: even when contacting a peer sentinel, it
   may still report the restarting pod as master (within the
   down-after-milliseconds window before failover completes).

Fix by adding a wait_for_master() function in common.sh that:
- Contacts each peer pod individually by FQDN (skipping self)
- Retries up to 10 times (30s total) before allowing bootstrap
- Rejects answers where the peer still thinks we are master

Also increase InitialDelaySeconds to 40s on all redis and sentinel
probes so Kubernetes doesn't kill the pod before the retry loop
completes, and remove unused TCP probe variables that were never
referenced by the redis container.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lmiccini lmiccini force-pushed the fix-redis-bootstrap-race branch from bb1e9bc to 8bb9d35 Compare April 20, 2026 12:33
@lmiccini
Copy link
Copy Markdown
Contributor Author

/test infra-operator-build-deploy-kuttl

1 similar comment
@lmiccini
Copy link
Copy Markdown
Contributor Author

/test infra-operator-build-deploy-kuttl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants