Skip to content

SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures#4312

Draft
janhoy wants to merge 2 commits intoapache:mainfrom
janhoy:SOLR-18203-chaos-monkey-503-wait-for-leader
Draft

SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures#4312
janhoy wants to merge 2 commits intoapache:mainfrom
janhoy:SOLR-18203-chaos-monkey-503-wait-for-leader

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented Apr 21, 2026

https://issues.apache.org/jira/browse/SOLR-18203

Below problem/fix description and the fix is done by Claude AI. I have limited insight into the actual chaos test and the CloudSolrClient's retry logic. Thus the problem description could be mistaken, and even if the fix seems to work, there could exist a more correct fix. Please give your 5 cents.

Problem

ChaosMonkeySafeLeaderWithPullReplicasTest has been failing at 82–100% for months (fucit report).
Skjermbilde 2026-04-21 kl  15 00 00

The test asserts zero update exceptions during chaos monkey (random node kills). The failure chain:

  1. Chaos monkey kills the shard leader → node enters graceful Jetty shutdown (15s timeout, introduced in SOLR-17744 April 2025), physically alive but logically dead
  2. CloudSolrClient routes the next update to the cached (dead) leader → 503 SERVICE_UNAVAILABLE
  3. requestWithRetryOnStaleState() sees RouteException(503) and retries — but all 5 retries fire within milliseconds of each other, using the same stale routes
  4. A 3-second backoff suppresses the ZK state refresh, so every retry hits the same dead leader → all 503 → update fails → test assertion assertEquals(0, getFailCount()) fails

Note that even if SOLR_JETTY_GRACEFUL is false by default in production, the Test runner always enables it. So this bug would only appear in tests.

Fix

In the 503 retry path, wait for a ZooKeeper cluster-state refresh (blocking, ~100ms) before each retry attempt. This spreads retries over the leader election period so the next retry routes to the newly elected leader.

Also bypass the markMaybeStaleIfOutsideBackoff 3-second backoff for 503 errors, since the per-retry wait already throttles the rate naturally.

The waitForCollectionRefresh / triggerCollectionRefresh pattern is already used for INVALID_STATE/404 stale-state handling — this extends it to the 503 case.

Testing

Ran ChaosMonkeySafeLeaderWithPullReplicasTest several times both on main and this feature branch. On main I got 8 failures out of 20 runs. On this pr branch I got 5 failures for 20 runs. Thus the fix is no 100% or there may be other code paths in play.

janhoy added 2 commits April 21, 2026 13:45
Without a timeout, waitForCollectionRefresh() would block indefinitely
if ZooKeeper is slow or leader election gets stuck. Now uses a bounded
10s get() on the CompletableFuture; on TimeoutException, logs a warning
and proceeds with the retry using current cached state — no worse than
the pre-fix behavior and avoids any possibility of hanging forever.
@janhoy janhoy requested a review from hossman April 21, 2026 14:39
@janhoy janhoy marked this pull request as draft April 21, 2026 14:39
@janhoy janhoy requested review from Copilot and markrmiller April 21, 2026 14:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves SolrJ’s stale-state retry handling for update requests that fail with 503 SERVICE_UNAVAILABLE (often during leader loss / restart), aiming to reduce flakiness in ChaosMonkeySafeLeaderWithPullReplicasTest and similar chaos scenarios.

Changes:

  • Adjusts CloudSolrClient.requestWithRetryOnStaleState() to immediately mark collection state as stale for 503 and wait for a collection state refresh before retrying.
  • Bypasses the existing stale-state backoff logic for 503 retries, relying on the refresh-wait instead.
  • Adds an unreleased changelog entry documenting the behavior change.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java Adds a blocking collection refresh-wait on the RouteException(503) retry path and changes stale-marking/backoff behavior.
changelog/unreleased/SOLR-18203-chaos-monkey-503-wait-for-leader.yml Documents the new 503 retry behavior for CloudSolrClient.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1311 to +1317
// For 503 "no leader" errors (not plain comm errors where the node is fully dead),
// wait for the cluster state to refresh from ZooKeeper before retrying.
// Without this wait, all MAX_STALE_RETRIES retries fire within milliseconds of each
// other using the same stale routes, hitting the same dead leader repeatedly and
// exhausting all retries before leader election can complete.
// We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block
// the caller thread indefinitely; on timeout we log and continue with the retry
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing SolrJ retry-related tests (e.g., solr/solrj/src/test/org/apache/solr/client/solrj/impl/CloudHttp2SolrClientRetryTest.java), but this new 503-specific refresh-wait behavior isn’t covered. Adding a focused test that simulates a RouteException(503) and asserts that a collection refresh is triggered and awaited (or that retries are delayed until state changes) would help prevent regressions and confirm the intended behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +1316 to +1336
// We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block
// the caller thread indefinitely; on timeout we log and continue with the retry
// using whatever state is currently available.
if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) {
for (DocCollection ext : requestedCollections) {
try {
triggerCollectionRefresh(ext.getName()).get(10, TimeUnit.SECONDS);
} catch (TimeoutException te) {
log.warn(
"Timed out waiting for cluster state refresh for collection {} before retry; "
+ "proceeding with retry using current state",
ext.getName());
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
} catch (ExecutionException ee) {
log.warn(
"Error refreshing cluster state for collection {} before retry",
ext.getName(),
ee.getCause());
}
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 503 retry path blocks up to 10s per collection per retry and does the waits sequentially. If an alias resolves to multiple collections, this can multiply into a long stall (N collections × 10s × MAX_STALE_RETRIES) on a single caller thread. Consider using a single overall deadline (or CompletableFuture.allOf over the refresh futures) and waiting in parallel with remaining-time budgeting so the total wait is bounded.

Suggested change
// We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block
// the caller thread indefinitely; on timeout we log and continue with the retry
// using whatever state is currently available.
if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) {
for (DocCollection ext : requestedCollections) {
try {
triggerCollectionRefresh(ext.getName()).get(10, TimeUnit.SECONDS);
} catch (TimeoutException te) {
log.warn(
"Timed out waiting for cluster state refresh for collection {} before retry; "
+ "proceeding with retry using current state",
ext.getName());
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
} catch (ExecutionException ee) {
log.warn(
"Error refreshing cluster state for collection {} before retry",
ext.getName(),
ee.getCause());
}
}
// We use a bounded wait (10 s total) so a stuck ZK / stalled election cannot block
// the caller thread indefinitely; on timeout we log and continue with the retry
// using whatever state is currently available.
if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) {
List<CompletableFuture<?>> refreshFutures = new ArrayList<>(requestedCollections.size());
for (DocCollection ext : requestedCollections) {
final String collectionName = ext.getName();
refreshFutures.add(
triggerCollectionRefresh(collectionName)
.handle(
(ignored, throwable) -> {
if (throwable != null) {
Throwable cause =
throwable.getCause() != null ? throwable.getCause() : throwable;
log.warn(
"Error refreshing cluster state for collection {} before retry",
collectionName,
cause);
}
return null;
}));
}
try {
CompletableFuture
.allOf(refreshFutures.toArray(new CompletableFuture<?>[0]))
.get(10, TimeUnit.SECONDS);
} catch (TimeoutException te) {
log.warn(
"Timed out waiting for cluster state refresh for collections {} before retry; "
+ "proceeding with retry using current state",
requestedCollections.stream()
.map(DocCollection::getName)
.collect(Collectors.toList()));
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
}

Copilot uses AI. Check for mistakes.
+ "proceeding with retry using current state",
ext.getName());
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On InterruptedException during the refresh wait you re-interrupt the thread but then continue into the recursive retry. This effectively ignores cancellation and can lead to surprising behavior (and is inconsistent with waitForCollectionRefresh(), which throws on interrupt). Consider aborting retries and propagating an exception once interrupted, or returning/throwing after restoring the interrupt flag.

Suggested change
Thread.currentThread().interrupt();
Thread.currentThread().interrupt();
throw new SolrServerException(
"Interrupted while waiting for cluster state refresh before retry for collection "
+ ext.getName(),
ie);

Copilot uses AI. Check for mistakes.
Comment on lines +1288 to +1291
// For both comm errors and 503 (no leader), mark state as stale immediately.
// For 503 we bypass the backoff since we will wait for the refresh below
// before retrying, which naturally throttles the retry rate.
cacheEntry.maybeStale = true;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes the existing retryExpiryTimeNano backoff for 503 errors by unconditionally setting maybeStale=true (instead of using markMaybeStaleIfOutsideBackoff). That means a burst of 503s can now trigger much more frequent state refreshes (and refresh-waits) across many concurrent client threads. If the goal is specifically to handle leader-election windows, consider keeping some throttling/jitter (or only bypassing backoff for the first 503 in a retry chain) to avoid ZK/cluster-state refresh storms under sustained 503 conditions.

Suggested change
// For both comm errors and 503 (no leader), mark state as stale immediately.
// For 503 we bypass the backoff since we will wait for the refresh below
// before retrying, which naturally throttles the retry rate.
cacheEntry.maybeStale = true;
// Throttle stale marking so repeated comm errors or 503s do not cause
// excessive concurrent cluster-state refresh attempts.
markMaybeStaleIfOutsideBackoff(cacheEntry);

Copilot uses AI. Check for mistakes.
@markrmiller
Copy link
Copy Markdown
Member

Please give your 5 cents.

I think Claude likely botched this and it's not the right fix.

You can give something like this a shot and see if the tests reliably pass for you.

I kind of prefer this, but I suppose one could also make sure the sequential, parallel, and fallback paths all throw a RouteException rather than just the parallel path.

Subject: [PATCH] try this
---
Index: solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java b/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java
--- a/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java	(revision fb7377cc11c2cbcd7f7259427699690d9ff15a50)
+++ b/solr/core/src/test/org/apache/solr/cloud/ChaosMonkeySafeLeaderWithPullReplicasTest.java	(date 1776789070865)
@@ -246,6 +246,7 @@
       zkServer.shutdown();
       zkServer = new ZkTestServer(zkServer.getZkDir(), zkServer.getPort());
       zkServer.run(false);
+      waitForRecoveriesToFinish(true, 30);
     }
 
     try (CloudSolrClient client = createCloudClient("collection1")) {
Index: solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java
--- a/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java	(revision fb7377cc11c2cbcd7f7259427699690d9ff15a50)
+++ b/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java	(date 1776788165074)
@@ -1270,10 +1270,7 @@
       final boolean wasCommError = wasCommError(rootCause);
 
       if (wasCommError
-          || (exc instanceof RouteException
-              && (errorCode == 503)) // 404 because the core does not exist 503 service unavailable
-      // TODO there are other reasons for 404. We need to change the solr response format from HTML
-      // to structured data to know that
+          || errorCode == 503 // service unavailable - Loss of a node can cause this via any path
       ) {
         // it was a communication error. it is likely that
         // the node to which the request to be sent is down . So , expire the state

@markrmiller
Copy link
Copy Markdown
Member

"Note that even if SOLR_JETTY_GRACEFUL is false by default in production, the Test runner always enables it. So this bug would only appear in tests."

I don't think that is right either. Graceful shutdown may make it easier to see a 503, but you will still see them without it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants