SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures#4312
SOLR-18203 Improve ChaosMonkeySafeLeaderWithPullReplicasTest frequent failures#4312janhoy wants to merge 2 commits intoapache:mainfrom
Conversation
Without a timeout, waitForCollectionRefresh() would block indefinitely if ZooKeeper is slow or leader election gets stuck. Now uses a bounded 10s get() on the CompletableFuture; on TimeoutException, logs a warning and proceeds with the retry using current cached state — no worse than the pre-fix behavior and avoids any possibility of hanging forever.
There was a problem hiding this comment.
Pull request overview
Improves SolrJ’s stale-state retry handling for update requests that fail with 503 SERVICE_UNAVAILABLE (often during leader loss / restart), aiming to reduce flakiness in ChaosMonkeySafeLeaderWithPullReplicasTest and similar chaos scenarios.
Changes:
- Adjusts
CloudSolrClient.requestWithRetryOnStaleState()to immediately mark collection state as stale for503and wait for a collection state refresh before retrying. - Bypasses the existing stale-state backoff logic for
503retries, relying on the refresh-wait instead. - Adds an unreleased changelog entry documenting the behavior change.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java | Adds a blocking collection refresh-wait on the RouteException(503) retry path and changes stale-marking/backoff behavior. |
| changelog/unreleased/SOLR-18203-chaos-monkey-503-wait-for-leader.yml | Documents the new 503 retry behavior for CloudSolrClient. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // For 503 "no leader" errors (not plain comm errors where the node is fully dead), | ||
| // wait for the cluster state to refresh from ZooKeeper before retrying. | ||
| // Without this wait, all MAX_STALE_RETRIES retries fire within milliseconds of each | ||
| // other using the same stale routes, hitting the same dead leader repeatedly and | ||
| // exhausting all retries before leader election can complete. | ||
| // We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block | ||
| // the caller thread indefinitely; on timeout we log and continue with the retry |
There was a problem hiding this comment.
There are existing SolrJ retry-related tests (e.g., solr/solrj/src/test/org/apache/solr/client/solrj/impl/CloudHttp2SolrClientRetryTest.java), but this new 503-specific refresh-wait behavior isn’t covered. Adding a focused test that simulates a RouteException(503) and asserts that a collection refresh is triggered and awaited (or that retries are delayed until state changes) would help prevent regressions and confirm the intended behavior.
| // We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block | ||
| // the caller thread indefinitely; on timeout we log and continue with the retry | ||
| // using whatever state is currently available. | ||
| if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) { | ||
| for (DocCollection ext : requestedCollections) { | ||
| try { | ||
| triggerCollectionRefresh(ext.getName()).get(10, TimeUnit.SECONDS); | ||
| } catch (TimeoutException te) { | ||
| log.warn( | ||
| "Timed out waiting for cluster state refresh for collection {} before retry; " | ||
| + "proceeding with retry using current state", | ||
| ext.getName()); | ||
| } catch (InterruptedException ie) { | ||
| Thread.currentThread().interrupt(); | ||
| } catch (ExecutionException ee) { | ||
| log.warn( | ||
| "Error refreshing cluster state for collection {} before retry", | ||
| ext.getName(), | ||
| ee.getCause()); | ||
| } | ||
| } |
There was a problem hiding this comment.
The 503 retry path blocks up to 10s per collection per retry and does the waits sequentially. If an alias resolves to multiple collections, this can multiply into a long stall (N collections × 10s × MAX_STALE_RETRIES) on a single caller thread. Consider using a single overall deadline (or CompletableFuture.allOf over the refresh futures) and waiting in parallel with remaining-time budgeting so the total wait is bounded.
| // We use a bounded wait (10 s) so a stuck ZK / stalled election cannot block | |
| // the caller thread indefinitely; on timeout we log and continue with the retry | |
| // using whatever state is currently available. | |
| if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) { | |
| for (DocCollection ext : requestedCollections) { | |
| try { | |
| triggerCollectionRefresh(ext.getName()).get(10, TimeUnit.SECONDS); | |
| } catch (TimeoutException te) { | |
| log.warn( | |
| "Timed out waiting for cluster state refresh for collection {} before retry; " | |
| + "proceeding with retry using current state", | |
| ext.getName()); | |
| } catch (InterruptedException ie) { | |
| Thread.currentThread().interrupt(); | |
| } catch (ExecutionException ee) { | |
| log.warn( | |
| "Error refreshing cluster state for collection {} before retry", | |
| ext.getName(), | |
| ee.getCause()); | |
| } | |
| } | |
| // We use a bounded wait (10 s total) so a stuck ZK / stalled election cannot block | |
| // the caller thread indefinitely; on timeout we log and continue with the retry | |
| // using whatever state is currently available. | |
| if (!wasCommError && requestedCollections != null && !requestedCollections.isEmpty()) { | |
| List<CompletableFuture<?>> refreshFutures = new ArrayList<>(requestedCollections.size()); | |
| for (DocCollection ext : requestedCollections) { | |
| final String collectionName = ext.getName(); | |
| refreshFutures.add( | |
| triggerCollectionRefresh(collectionName) | |
| .handle( | |
| (ignored, throwable) -> { | |
| if (throwable != null) { | |
| Throwable cause = | |
| throwable.getCause() != null ? throwable.getCause() : throwable; | |
| log.warn( | |
| "Error refreshing cluster state for collection {} before retry", | |
| collectionName, | |
| cause); | |
| } | |
| return null; | |
| })); | |
| } | |
| try { | |
| CompletableFuture | |
| .allOf(refreshFutures.toArray(new CompletableFuture<?>[0])) | |
| .get(10, TimeUnit.SECONDS); | |
| } catch (TimeoutException te) { | |
| log.warn( | |
| "Timed out waiting for cluster state refresh for collections {} before retry; " | |
| + "proceeding with retry using current state", | |
| requestedCollections.stream() | |
| .map(DocCollection::getName) | |
| .collect(Collectors.toList())); | |
| } catch (InterruptedException ie) { | |
| Thread.currentThread().interrupt(); | |
| } |
| + "proceeding with retry using current state", | ||
| ext.getName()); | ||
| } catch (InterruptedException ie) { | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
On InterruptedException during the refresh wait you re-interrupt the thread but then continue into the recursive retry. This effectively ignores cancellation and can lead to surprising behavior (and is inconsistent with waitForCollectionRefresh(), which throws on interrupt). Consider aborting retries and propagating an exception once interrupted, or returning/throwing after restoring the interrupt flag.
| Thread.currentThread().interrupt(); | |
| Thread.currentThread().interrupt(); | |
| throw new SolrServerException( | |
| "Interrupted while waiting for cluster state refresh before retry for collection " | |
| + ext.getName(), | |
| ie); |
| // For both comm errors and 503 (no leader), mark state as stale immediately. | ||
| // For 503 we bypass the backoff since we will wait for the refresh below | ||
| // before retrying, which naturally throttles the retry rate. | ||
| cacheEntry.maybeStale = true; |
There was a problem hiding this comment.
This change removes the existing retryExpiryTimeNano backoff for 503 errors by unconditionally setting maybeStale=true (instead of using markMaybeStaleIfOutsideBackoff). That means a burst of 503s can now trigger much more frequent state refreshes (and refresh-waits) across many concurrent client threads. If the goal is specifically to handle leader-election windows, consider keeping some throttling/jitter (or only bypassing backoff for the first 503 in a retry chain) to avoid ZK/cluster-state refresh storms under sustained 503 conditions.
| // For both comm errors and 503 (no leader), mark state as stale immediately. | |
| // For 503 we bypass the backoff since we will wait for the refresh below | |
| // before retrying, which naturally throttles the retry rate. | |
| cacheEntry.maybeStale = true; | |
| // Throttle stale marking so repeated comm errors or 503s do not cause | |
| // excessive concurrent cluster-state refresh attempts. | |
| markMaybeStaleIfOutsideBackoff(cacheEntry); |
I think Claude likely botched this and it's not the right fix. You can give something like this a shot and see if the tests reliably pass for you. I kind of prefer this, but I suppose one could also make sure the sequential, parallel, and fallback paths all throw a RouteException rather than just the parallel path. |
|
"Note that even if SOLR_JETTY_GRACEFUL is false by default in production, the Test runner always enables it. So this bug would only appear in tests." I don't think that is right either. Graceful shutdown may make it easier to see a 503, but you will still see them without it as well. |
https://issues.apache.org/jira/browse/SOLR-18203
Below problem/fix description and the fix is done by Claude AI. I have limited insight into the actual chaos test and the CloudSolrClient's retry logic. Thus the problem description could be mistaken, and even if the fix seems to work, there could exist a more correct fix. Please give your 5 cents.
Problem
ChaosMonkeySafeLeaderWithPullReplicasTesthas been failing at 82–100% for months (fucit report).The test asserts zero update exceptions during chaos monkey (random node kills). The failure chain:
CloudSolrClientroutes the next update to the cached (dead) leader → 503 SERVICE_UNAVAILABLErequestWithRetryOnStaleState()seesRouteException(503)and retries — but all 5 retries fire within milliseconds of each other, using the same stale routesassertEquals(0, getFailCount())failsNote that even if
SOLR_JETTY_GRACEFULisfalseby default in production, the Test runner always enables it. So this bug would only appear in tests.Fix
In the 503 retry path, wait for a ZooKeeper cluster-state refresh (blocking, ~100ms) before each retry attempt. This spreads retries over the leader election period so the next retry routes to the newly elected leader.
Also bypass the
markMaybeStaleIfOutsideBackoff3-second backoff for 503 errors, since the per-retry wait already throttles the rate naturally.The
waitForCollectionRefresh/triggerCollectionRefreshpattern is already used forINVALID_STATE/404 stale-state handling — this extends it to the 503 case.Testing
Ran
ChaosMonkeySafeLeaderWithPullReplicasTestseveral times both on main and this feature branch. On main I got 8 failures out of 20 runs. On this pr branch I got 5 failures for 20 runs. Thus the fix is no 100% or there may be other code paths in play.