Skip to content

OTLP HTTP endpoint hangs in certain conditions #6326

@oleksandr-zhyhalo

Description

@oleksandr-zhyhalo

Summary

In a single-node Quickwit 0.9.0-nightly deployment with a Postgres metastore and an SQS file source running at ~1500 files/minute, the OTLP HTTP endpoint (POST /api/v1/otlp/v1/logs) hangs indefinitely. Quickwit logs report:

ERROR quickwit_serve::otlp_api::rest_handler:
  otlp internal error: status: 'The service is currently unavailable',
  self: "ingest service is unavailable (no shards available)"

ERROR quickwit_ingest::ingest_v2::router:
  ingest request should not timeout as there is a timeout on independent ingest requests too.
  timeout after 35000

ERROR quickwit_actors::actor_handle: actor-timeout actor="ControlPlane-..."

But the chitchat state shows an _ingest-source shard IS created and assigned to the indexer, and ingester.status=ready. The router cannot see the shard as assignable.

Environment

Item Value
Image quickwit/quickwit:v0.9.0-rc (published 2026-04-19 on Docker Hub)
Deployment Single-node via docker-compose, command: run
Host AWS EC2, 4 vCPUs, not resource-constrained (CPU/memory under-utilised)
Target aarch64-unknown-linux-gnu
Metastore PostgreSQL
Storage S3 (s3://…/indexes/, region eu-west-1)
enabled_services (chitchat) metastore,searcher,control_plane,janitor,indexer
ingester.status (chitchat) ready
readiness (chitchat) READY

(Yes, I did restart container)

Workload

  • Noisy index: ***-logs with an SQS file source (***-sqs-filesource) consuming S3 notifications.

  • Sustained rate: ~1500 files per minute. Each S3 file becomes a distinct shard in the metastore:

    INFO quickwit_metastore::metastore::postgres::metastore:
      opened shard index_uid=***-logs:01KNF3635YNGTBZCWQEY6943JP
      source_id=***-sqs-filesource
      shard_id=s3://***-logs-prod/.../1776710923-xxxxx.ndjson.gz
      leader_id= follower_id=None
    

    at roughly 15–20 log lines per second.

  • Target index for OTLP: otel-logs-v0_9, ingest_settings.min_shards=1, _ingest-source (ingest-v2) present alongside _ingest-api-source.

Recurring error pattern in Quickwit logs

ERROR quickwit_actors::actor_handle: actor-timeout actor="ControlPlane-purple-6GTo"
ERROR quickwit_serve::otlp_api::rest_handler: otlp internal error: ... "ingest service is unavailable (no shards available)"
ERROR quickwit_ingest::ingest_v2::router: ingest request should not timeout... timeout after 35000
ERROR quickwit_indexing::source::queue_sources::shared_state: failed to prune shards error=TooManyRequests
ERROR quickwit_serve::rest: failed to serve connection: connection closed before message completed

The TooManyRequests from queue-sources correlates with the S3/SQS shard churn.

Direct Quickwit (bypassing any proxy) repros

curl -i --max-time 20 -X POST 'http://localhost:7280/api/v1/otlp/v1/logs' \
  -H 'content-type: application/x-protobuf' \
  -H 'qw-otel-logs-index: otel-logs-v0_9' \
  --data-binary @valid-otlp.bin
# curl: (28) Operation timed out after 20001 milliseconds with 0 bytes received

A malformed or truncated protobuf body (\x00, or the first 50/100 bytes of a valid body) returns 400 "failed to decode Protobuf message" instantly — proving the request reaches Quickwit and the parse path is fast. Only full valid bodies hang.

Disabling the SQS file source unblocks OTLP

I temporarily disabled SQS file sources, then retried the Node OTLP client. OTLP started working within seconds — the log record landed in otel-logs-v0_9 on the first attempt.

After enabling sqs source, currently endpoint is responsive, I will post any additional updates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions