Routing table retains stale entries for short-lived shards

**Describe the bug**

In my setup, I run a suite of unit tests, which does the following:

- create an index
- ingest some logs
- query some logs
- delete the index
- run more tests with the same setup

On the second test (and all following tests), I'm getting a failure with a `503 ingest service is unavailable (no shards available)`

**Steps to reproduce**

Here is a small script to reproduce the problem:

<details>
<summary>reproduce-503.sh</summary>

```
#!/usr/bin/env bash
# Reproduces the 503 "no shards available" bug (commits 92a526b → e1732a7).
# Creates index → ingests → deletes → re-creates → ingests again (should 503 on buggy builds).
set -euo pipefail

IMAGE="${QUICKWIT_IMAGE:-quickwit/quickwit:v0.9.0-rc}"
CONTAINER="qw-repro-503-$$"
URL="http://localhost:7280"
INDEX="test-index-repro"

trap 'echo "Cleaning up..."; docker rm -f "$CONTAINER" >/dev/null 2>&1 || true' EXIT

NOW=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
PAYLOAD='{"timestamp":"'"$NOW"'","message":"msg1"}
{"timestamp":"'"$NOW"'","message":"msg2"}
{"timestamp":"'"$NOW"'","message":"msg3"}'

INDEX_CONFIG='{"version":"0.8","index_id":"'"$INDEX"'","doc_mapping":{"field_mappings":[{"name":"timestamp","type":"datetime","input_formats":["iso8601"],"fast":true},{"name":"message","type":"text"}],"timestamp_field":"timestamp"}}'

# Start Quickwit
echo "Starting Quickwit ($IMAGE)..."
docker run -d --name "$CONTAINER" -p 7280:7280 -e QW_DISABLE_TELEMETRY=true RUST_LOG=debug "$IMAGE" run >/dev/null

echo "Waiting for readiness..."
for i in $(seq 1 60); do
    curl -sf "$URL/health/readyz" >/dev/null 2>&1 && break
    [ "$i" -eq 60 ] && { echo "FAIL: not ready after 60s"; exit 1; }
    sleep 1
done
echo "Ready."

# Round 1: create → ingest → delete
echo "--- Round 1 ---"
curl -sf -X POST "$URL/api/v1/indexes" -H "Content-Type: application/json" -d "$INDEX_CONFIG" >/dev/null
echo "Index created."

curl -sf -X POST "$URL/api/v1/$INDEX/ingest" -H "Content-Type: application/x-ndjson" -d "$PAYLOAD" >/dev/null
echo "Ingest OK."

curl -sf -X DELETE "$URL/api/v1/indexes/$INDEX" >/dev/null
echo "Index deleted."

# Wait
echo "Waiting 10s..."
sleep 10

# Round 2: re-create → ingest (expected to fail with bug)
echo "--- Round 2 ---"
curl -sf -X POST "$URL/api/v1/indexes" -H "Content-Type: application/json" -d "$INDEX_CONFIG" >/dev/null
echo "Index re-created."

# If we wait here, the bug doesn't reproduce (maybe because the shards have time to be marked as failed and removed from the cluster state?).
# sleep 10

echo "Ingesting (round 2)..."
curl -sv -X POST "$URL/api/v1/$INDEX/ingest" -H "Content-Type: application/x-ndjson" -d "$PAYLOAD"
```

</details>

**Expected behavior**
No 503 for short-lived shards.

**Additional information**

- If I add a `sleep 10` between the index creation and the insert, then the problem disappears (see comment in the reproduction script).
- The problem was introduced between commits https://github.com/quickwit-oss/quickwit/commit/92a526b and https://github.com/quickwit-oss/quickwit/commit/e1732a7.
- From my understanding, I believe it might be related to the recent routing change introduced by @nadav-govari 
with https://github.com/quickwit-oss/quickwit/pull/6203. 
- I believe the problem could be an edge-case between 2 ticks of the [BroadcastIngesterCapacityScoreTask](https://github.com/quickwit-oss/quickwit/blob/cd2e7a4f13fc25b8d6c5bf7547d037ec11137180/quickwit/quickwit-ingest/src/ingest_v2/broadcast/capacity_score.rs#L87) background task.
- I'm not sure if such short-lived shard can happen in a "real" production setup (for example with really low commitTimeOut), or if it's just a unit-test only type of problem?
- here are some logs of the problem: [logs-debug.txt.zip](https://github.com/user-attachments/files/26894940/logs-debug.txt.zip)

**Configuration:**

quickwit version: 0.9.0 (aarch64-unknown-linux-gnu 2026-04-19T08:54:33Z e1732a7)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Routing table retains stale entries for short-lived shards #6324

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Routing table retains stale entries for short-lived shards #6324

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions