Skip to content

SOLR-17821: Fix error scenario for ShardInstall or Restore#3434

Open
HoustonPutman wants to merge 88 commits intoapache:mainfrom
HoustonPutman:fix-restore-partial-success
Open

SOLR-17821: Fix error scenario for ShardInstall or Restore#3434
HoustonPutman wants to merge 88 commits intoapache:mainfrom
HoustonPutman:fix-restore-partial-success

Conversation

@HoustonPutman
Copy link
Copy Markdown
Contributor

@HoustonPutman HoustonPutman commented Jul 21, 2025

https://issues.apache.org/jira/browse/SOLR-17821

The scenario:

  • A restore or shard install is called on a shard
  • A non-leader replica succeeds, all else fail

Currently, the following happens:

  • The ZK Shard terms are updated to ensure that all terms are non-zero
  • A failure is returned
  • But the cluster state is unchanged, and all shards are still in the state the started at. Even though not all have the same index

What we want to happen:

  • The ZK Shard terms are updated such that the successful replica(s) are the highest terms
  • Since the leader is no longer the highest term, it should give up leadership
  • All failing replicas should go into Leader-Initiated-Recovery
  • Once recovery has started, our InstallShard/Recovery command can succeed since the results will be what the user expects
    • We can add a waitForAllReplicasToBeHealthy option to wait for the recoveries to finish

This requires a few changes:

  • Obviously we want to fix the Restore and InstallShard commands to update shard terms correctly
  • Leadership should be given up when the shard term is lower than the highest shard term
  • Recovery should succeed even though the collection is in read-only mode
  • The tests should be able to test that the leader fails, and all other replicas succeed
  • Recover and InstallShard should manipulate the responses, so that the AsyncTracker does not think we are unsuccessful when replicas are put into recovery
  • We should add flags so that the user can control which replicas to download to, and when the response should be sent back (after recovery or not). - SOLR-18205
  • The CollectionHandlingUtils need to encode and save coreName with requests/responses, in order to distinguish multiple core requests sent to the same node.

@HoustonPutman
Copy link
Copy Markdown
Contributor Author

Some of the code is kind of hacky right now. But the bad stuff shouldn't be too hard to clean up.

@HoustonPutman HoustonPutman changed the title SOLR-17821: Fix error scenario for ShardInstall or Recover SOLR-17821: Fix error scenario for ShardInstall or Restore Jul 21, 2025
@github-actions github-actions Bot added the tests label Jul 21, 2025
@HoustonPutman
Copy link
Copy Markdown
Contributor Author

The implementation works for InstallShard, but we need to add this same functionality to Restore as well.

@github-actions
Copy link
Copy Markdown

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

@github-actions github-actions Bot added the stale PR not updated in 60 days label Sep 20, 2025
@github-actions
Copy link
Copy Markdown

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

@github-actions github-actions Bot added the closed-stale Closed after being stale for 60 days label Nov 19, 2025
@github-actions github-actions Bot closed this Nov 19, 2025
@HoustonPutman HoustonPutman reopened this Nov 19, 2025
@github-actions github-actions Bot removed closed-stale Closed after being stale for 60 days stale PR not updated in 60 days labels Nov 20, 2025
@HoustonPutman
Copy link
Copy Markdown
Contributor Author

This is now ready with all of the prerequisite PRs being merged.

@gerlowskija and I were talking and we really do need to converge around a language between install and restore because now they mean the same thing and shard the same code. So Restore/Install collection/shard/core should all generally be sharing the same language. We can probably do that in a future PR, but it should be called out. (Created a JIRA for this: SOLR-18204)

I have also created a separate JIRA (SOLR-18205) to handle configuring which replicas actually download/install data from the backup repository. (And then Solr can handle the rest of the replication)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant