Skip to content

Routing table retains stale entries for short-lived shards #6324

@ncoiffier-celonis

Description

@ncoiffier-celonis

Describe the bug

In my setup, I run a suite of unit tests, which does the following:

  • create an index
  • ingest some logs
  • query some logs
  • delete the index
  • run more tests with the same setup

On the second test (and all following tests), I'm getting a failure with a 503 ingest service is unavailable (no shards available)

Steps to reproduce

Here is a small script to reproduce the problem:

reproduce-503.sh
#!/usr/bin/env bash
# Reproduces the 503 "no shards available" bug (commits 92a526b → e1732a7).
# Creates index → ingests → deletes → re-creates → ingests again (should 503 on buggy builds).
set -euo pipefail

IMAGE="${QUICKWIT_IMAGE:-quickwit/quickwit:v0.9.0-rc}"
CONTAINER="qw-repro-503-$$"
URL="http://localhost:7280"
INDEX="test-index-repro"

trap 'echo "Cleaning up..."; docker rm -f "$CONTAINER" >/dev/null 2>&1 || true' EXIT

NOW=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
PAYLOAD='{"timestamp":"'"$NOW"'","message":"msg1"}
{"timestamp":"'"$NOW"'","message":"msg2"}
{"timestamp":"'"$NOW"'","message":"msg3"}'

INDEX_CONFIG='{"version":"0.8","index_id":"'"$INDEX"'","doc_mapping":{"field_mappings":[{"name":"timestamp","type":"datetime","input_formats":["iso8601"],"fast":true},{"name":"message","type":"text"}],"timestamp_field":"timestamp"}}'

# Start Quickwit
echo "Starting Quickwit ($IMAGE)..."
docker run -d --name "$CONTAINER" -p 7280:7280 -e QW_DISABLE_TELEMETRY=true RUST_LOG=debug "$IMAGE" run >/dev/null

echo "Waiting for readiness..."
for i in $(seq 1 60); do
    curl -sf "$URL/health/readyz" >/dev/null 2>&1 && break
    [ "$i" -eq 60 ] && { echo "FAIL: not ready after 60s"; exit 1; }
    sleep 1
done
echo "Ready."

# Round 1: create → ingest → delete
echo "--- Round 1 ---"
curl -sf -X POST "$URL/api/v1/indexes" -H "Content-Type: application/json" -d "$INDEX_CONFIG" >/dev/null
echo "Index created."

curl -sf -X POST "$URL/api/v1/$INDEX/ingest" -H "Content-Type: application/x-ndjson" -d "$PAYLOAD" >/dev/null
echo "Ingest OK."

curl -sf -X DELETE "$URL/api/v1/indexes/$INDEX" >/dev/null
echo "Index deleted."

# Wait
echo "Waiting 10s..."
sleep 10

# Round 2: re-create → ingest (expected to fail with bug)
echo "--- Round 2 ---"
curl -sf -X POST "$URL/api/v1/indexes" -H "Content-Type: application/json" -d "$INDEX_CONFIG" >/dev/null
echo "Index re-created."

# If we wait here, the bug doesn't reproduce (maybe because the shards have time to be marked as failed and removed from the cluster state?).
# sleep 10

echo "Ingesting (round 2)..."
curl -sv -X POST "$URL/api/v1/$INDEX/ingest" -H "Content-Type: application/x-ndjson" -d "$PAYLOAD"

Expected behavior
No 503 for short-lived shards.

Additional information

  • If I add a sleep 10 between the index creation and the insert, then the problem disappears (see comment in the reproduction script).
  • The problem was introduced between commits 92a526b and e1732a7.
  • From my understanding, I believe it might be related to the recent routing change introduced by @nadav-govari
    with Merge feature node based routing #6203.
  • I believe the problem could be an edge-case between 2 ticks of the BroadcastIngesterCapacityScoreTask background task.
  • I'm not sure if such short-lived shard can happen in a "real" production setup (for example with really low commitTimeOut), or if it's just a unit-test only type of problem?
  • here are some logs of the problem: logs-debug.txt.zip

Configuration:

quickwit version: 0.9.0 (aarch64-unknown-linux-gnu 2026-04-19T08:54:33Z e1732a7)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions