SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures by janhoy · Pull Request #4312 · apache/solr

janhoy · 2026-04-21T14:39:35Z

https://issues.apache.org/jira/browse/SOLR-18203

Below problem/fix description and the fix is done by Claude AI. I have limited insight into the actual chaos test and the CloudSolrClient's retry logic. Thus the problem description could be mistaken, and even if the fix seems to work, there could exist a more correct fix. Please give your 5 cents.

Problem

ChaosMonkeySafeLeaderWithPullReplicasTest has been failing at 82–100% for months (fucit report).

The test asserts zero update exceptions during chaos monkey (random node kills). The failure chain:

Chaos monkey kills the shard leader → node enters graceful Jetty shutdown (15s timeout, introduced in SOLR-17744 April 2025), physically alive but logically dead
CloudSolrClient routes the next update to the cached (dead) leader → 503 SERVICE_UNAVAILABLE
requestWithRetryOnStaleState() sees RouteException(503) and retries — but all 5 retries fire within milliseconds of each other, using the same stale routes
A 3-second backoff suppresses the ZK state refresh, so every retry hits the same dead leader → all 503 → update fails → test assertion assertEquals(0, getFailCount()) fails

Note that even if SOLR_JETTY_GRACEFUL is false by default in production, the Test runner always enables it. So this bug would only appear in tests.

Fix

In the 503 retry path, wait for a ZooKeeper cluster-state refresh (blocking, ~100ms) before each retry attempt. This spreads retries over the leader election period so the next retry routes to the newly elected leader.

Also bypass the markMaybeStaleIfOutsideBackoff 3-second backoff for 503 errors, since the per-retry wait already throttles the rate naturally.

The waitForCollectionRefresh / triggerCollectionRefresh pattern is already used for INVALID_STATE/404 stale-state handling — this extends it to the 503 case.

Testing

Ran ChaosMonkeySafeLeaderWithPullReplicasTest several times both on main and this feature branch. On main I got 8 failures out of 20 runs. On this pr branch I got 5 failures for 20 runs. Thus the fix is no 100% or there may be other code paths in play.

…03 (no leader)

Without a timeout, waitForCollectionRefresh() would block indefinitely if ZooKeeper is slow or leader election gets stuck. Now uses a bounded 10s get() on the CompletableFuture; on TimeoutException, logs a warning and proceeds with the retry using current cached state — no worse than the pre-fix behavior and avoids any possibility of hanging forever.

Copilot

Pull request overview

Improves SolrJ’s stale-state retry handling for update requests that fail with 503 SERVICE_UNAVAILABLE (often during leader loss / restart), aiming to reduce flakiness in ChaosMonkeySafeLeaderWithPullReplicasTest and similar chaos scenarios.

Changes:

Adjusts CloudSolrClient.requestWithRetryOnStaleState() to immediately mark collection state as stale for 503 and wait for a collection state refresh before retrying.
Bypasses the existing stale-state backoff logic for 503 retries, relying on the refresh-wait instead.
Adds an unreleased changelog entry documenting the behavior change.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java	Adds a blocking collection refresh-wait on the `RouteException(503)` retry path and changes stale-marking/backoff behavior.
changelog/unreleased/SOLR-18203-chaos-monkey-503-wait-for-leader.yml	Documents the new 503 retry behavior for CloudSolrClient.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T14:46:17Z

+          // For 503 "no leader" errors (not plain comm errors where the node is fully dead),
+          // wait for the cluster state to refresh from ZooKeeper before retrying.
+          // Without this wait, all MAX_STALE_RETRIES retries fire within milliseconds of each
+          // other using the same stale routes, hitting the same dead leader repeatedly and
+          // exhausting all retries before leader election can complete.
+          // We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block
+          // the caller thread indefinitely; on timeout we log and continue with the retry


There are existing SolrJ retry-related tests (e.g., solr/solrj/src/test/org/apache/solr/client/solrj/impl/CloudHttp2SolrClientRetryTest.java), but this new 503-specific refresh-wait behavior isn’t covered. Adding a focused test that simulates a RouteException(503) and asserts that a collection refresh is triggered and awaited (or that retries are delayed until state changes) would help prevent regressions and confirm the intended behavior.

Copilot · 2026-04-21T14:46:17Z

+          // We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block
+          // the caller thread indefinitely; on timeout we log and continue with the retry
+          // using whatever state is currently available.
+          if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) {
+            for (DocCollection ext : requestedCollections) {
+              try {
+                triggerCollectionRefresh(ext.getName()).get(10, TimeUnit.SECONDS);
+              } catch (TimeoutException te) {
+                log.warn(
+                    "Timed out waiting for cluster state refresh for collection {} before retry; "
+                        + "proceeding with retry using current state",
+                    ext.getName());
+              } catch (InterruptedException ie) {
+                Thread.currentThread().interrupt();
+              } catch (ExecutionException ee) {
+                log.warn(
+                    "Error refreshing cluster state for collection {} before retry",
+                    ext.getName(),
+                    ee.getCause());
+              }
+            }


The 503 retry path blocks up to 10s per collection per retry and does the waits sequentially. If an alias resolves to multiple collections, this can multiply into a long stall (N collections × 10s × MAX_STALE_RETRIES) on a single caller thread. Consider using a single overall deadline (or CompletableFuture.allOf over the refresh futures) and waiting in parallel with remaining-time budgeting so the total wait is bounded.

Suggested change

// We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block

// the caller thread indefinitely; on timeout we log and continue with the retry

// using whatever state is currently available.

if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) {

for (DocCollection ext : requestedCollections) {

try {

triggerCollectionRefresh(ext.getName()).get(10, TimeUnit.SECONDS);

} catch (TimeoutException te) {

log.warn(

"Timed out waiting for cluster state refresh for collection {} before retry; "

+ "proceeding with retry using current state",

ext.getName());

} catch (InterruptedException ie) {

Thread.currentThread().interrupt();

} catch (ExecutionException ee) {

log.warn(

"Error refreshing cluster state for collection {} before retry",

ext.getName(),

ee.getCause());

}

}

// We use a bounded wait (10 s total) so a stuck ZK / stalled election cannot block

// the caller thread indefinitely; on timeout we log and continue with the retry

// using whatever state is currently available.

if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) {

List<CompletableFuture<?>> refreshFutures = new ArrayList<>(requestedCollections.size());

for (DocCollection ext : requestedCollections) {

final String collectionName = ext.getName();

refreshFutures.add(

triggerCollectionRefresh(collectionName)

.handle(

(ignored, throwable) -> {

if (throwable != null) {

Throwable cause =

throwable.getCause() != null ? throwable.getCause() : throwable;

log.warn(

"Error refreshing cluster state for collection {} before retry",

collectionName,

cause);

}

return null;

}));

}

try {

CompletableFuture

.allOf(refreshFutures.toArray(new CompletableFuture<?>[0]))

.get(10, TimeUnit.SECONDS);

} catch (TimeoutException te) {

log.warn(

"Timed out waiting for cluster state refresh for collections {} before retry; "

+ "proceeding with retry using current state",

requestedCollections.stream()

.map(DocCollection::getName)

.collect(Collectors.toList()));

} catch (InterruptedException ie) {

Thread.currentThread().interrupt();

}

Copilot · 2026-04-21T14:46:17Z

+                        + "proceeding with retry using current state",
+                    ext.getName());
+              } catch (InterruptedException ie) {
+                Thread.currentThread().interrupt();


On InterruptedException during the refresh wait you re-interrupt the thread but then continue into the recursive retry. This effectively ignores cancellation and can lead to surprising behavior (and is inconsistent with waitForCollectionRefresh(), which throws on interrupt). Consider aborting retries and propagating an exception once interrupted, or returning/throwing after restoring the interrupt flag.

Suggested change

Thread.currentThread().interrupt();

Thread.currentThread().interrupt();

throw new SolrServerException(

"Interrupted while waiting for cluster state refresh before retry for collection "

+ ext.getName(),

ie);

Copilot · 2026-04-21T14:46:18Z

+              // For both comm errors and 503 (no leader), mark state as stale immediately.
+              // For 503 we bypass the backoff since we will wait for the refresh below
+              // before retrying, which naturally throttles the retry rate.
+              cacheEntry.maybeStale = true;


This change removes the existing retryExpiryTimeNano backoff for 503 errors by unconditionally setting maybeStale=true (instead of using markMaybeStaleIfOutsideBackoff). That means a burst of 503s can now trigger much more frequent state refreshes (and refresh-waits) across many concurrent client threads. If the goal is specifically to handle leader-election windows, consider keeping some throttling/jitter (or only bypassing backoff for the first 503 in a retry chain) to avoid ZK/cluster-state refresh storms under sustained 503 conditions.

Suggested change

// For both comm errors and 503 (no leader), mark state as stale immediately.

// For 503 we bypass the backoff since we will wait for the refresh below

// before retrying, which naturally throttles the retry rate.

cacheEntry.maybeStale = true;

// Throttle stale marking so repeated comm errors or 503s do not cause

// excessive concurrent cluster-state refresh attempts.

markMaybeStaleIfOutsideBackoff(cacheEntry);

markrmiller · 2026-04-21T17:57:31Z

Please give your 5 cents.

I think Claude likely botched this and it's not the right fix.

You can give something like this a shot and see if the tests reliably pass for you.

I kind of prefer this, but I suppose one could also make sure the sequential, parallel, and fallback paths all throw a RouteException rather than just the parallel path.

Subject: [PATCH] try this
---
Index: solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java b/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java
--- a/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java	(revision fb7377cc11c2cbcd7f7259427699690d9ff15a50)
+++ b/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java	(date 1776789070865)
@@ -246,6 +246,7 @@
       zkServer.shutdown();
       zkServer = new ZkTestServer(zkServer.getZkDir(), zkServer.getPort());
       zkServer.run(false);
+      waitForRecoveriesToFinish(true, 30);
     }
 
     try (CloudSolrClient client = createCloudClient("collection1")) {
Index: solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
--- a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java	(revision fb7377cc11c2cbcd7f7259427699690d9ff15a50)
+++ b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java	(date 1776788165074)
@@ -1270,10 +1270,7 @@
       final boolean wasCommError = wasCommError(rootCause);
 
       if (wasCommError
-          || (exc instanceof RouteException
-              && (errorCode == 503)) // 404 because the core does not exist 503 service unavailable
-      // TODO there are other reasons for 404. We need to change the solr response format from HTML
-      // to structured data to know that
+          || errorCode == 503 // service unavailable - Loss of a node can cause this via any path
       ) {
         // it was a communication error. it is likely that
         // the node to which the request to be sent is down . So , expire the state

markrmiller · 2026-04-21T18:23:26Z

"Note that even if SOLR_JETTY_GRACEFUL is false by default in production, the Test runner always enables it. So this bug would only appear in tests."

I don't think that is right either. Graceful shutdown may make it easier to see a 503, but you will still see them without it as well.

janhoy added 2 commits April 21, 2026 13:45

SOLR-18203: CloudSolrClient waits for state refresh before retrying 5…

719105d

…03 (no leader)

janhoy requested a review from hossman April 21, 2026 14:39

janhoy marked this pull request as draft April 21, 2026 14:39

janhoy requested review from Copilot and markrmiller April 21, 2026 14:40

github-actions Bot added the client:solrj label Apr 21, 2026

Copilot started reviewing on behalf of janhoy April 21, 2026 14:41 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures#4312

SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures#4312
janhoy wants to merge 2 commits intoapache:mainfrom
janhoy:SOLR-18203-chaos-monkey-503-wait-for-leader

janhoy commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

markrmiller commented Apr 21, 2026

Uh oh!

markrmiller commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janhoy commented Apr 21, 2026

Problem

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

markrmiller commented Apr 21, 2026

Uh oh!

markrmiller commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants