OTLP HTTP endpoint hangs in certain conditions

## Summary

In a single-node Quickwit 0.9.0-nightly deployment with a Postgres metastore and an SQS file source running at ~1500 files/minute, the OTLP HTTP endpoint (`POST /api/v1/otlp/v1/logs`) hangs indefinitely. Quickwit logs report:

```
ERROR quickwit_serve::otlp_api::rest_handler:
  otlp internal error: status: 'The service is currently unavailable',
  self: "ingest service is unavailable (no shards available)"

ERROR quickwit_ingest::ingest_v2::router:
  ingest request should not timeout as there is a timeout on independent ingest requests too.
  timeout after 35000

ERROR quickwit_actors::actor_handle: actor-timeout actor="ControlPlane-..."
```

But the chitchat state shows an `_ingest-source` shard IS created and assigned to the indexer, and `ingester.status=ready`. The router cannot see the shard as assignable. 

## Environment

| Item | Value |
|---|---|
| Image | `quickwit/quickwit:v0.9.0-rc` (published 2026-04-19 on Docker Hub) |
| Deployment | Single-node via docker-compose, `command: run` |
| Host | AWS EC2, 4 vCPUs, not resource-constrained (CPU/memory under-utilised) |
| Target | `aarch64-unknown-linux-gnu` |
| Metastore | PostgreSQL |
| Storage | S3 (`s3://…/indexes/`, region `eu-west-1`) |
| `enabled_services` (chitchat) | `metastore,searcher,control_plane,janitor,indexer` |
| `ingester.status` (chitchat) | `ready` |
| `readiness` (chitchat) | `READY` |

(Yes, I did restart container)

## Workload

- **Noisy index**: `***-logs` with an SQS file source (`***-sqs-filesource`) consuming S3 notifications.
- Sustained rate: **~1500 files per minute.** Each S3 file becomes a distinct shard in the metastore:

  ```
  INFO quickwit_metastore::metastore::postgres::metastore:
    opened shard index_uid=***-logs:01KNF3635YNGTBZCWQEY6943JP
    source_id=***-sqs-filesource
    shard_id=s3://***-logs-prod/.../1776710923-xxxxx.ndjson.gz
    leader_id= follower_id=None
  ```
  at roughly 15–20 log lines per second.

- **Target index for OTLP**: `otel-logs-v0_9`, `ingest_settings.min_shards=1`, `_ingest-source` (ingest-v2) present alongside `_ingest-api-source`.


### Recurring error pattern in Quickwit logs

```
ERROR quickwit_actors::actor_handle: actor-timeout actor="ControlPlane-purple-6GTo"
ERROR quickwit_serve::otlp_api::rest_handler: otlp internal error: ... "ingest service is unavailable (no shards available)"
ERROR quickwit_ingest::ingest_v2::router: ingest request should not timeout... timeout after 35000
ERROR quickwit_indexing::source::queue_sources::shared_state: failed to prune shards error=TooManyRequests
ERROR quickwit_serve::rest: failed to serve connection: connection closed before message completed
```

The `TooManyRequests` from queue-sources correlates with the S3/SQS shard churn.


### Direct Quickwit (bypassing any proxy) repros

```
curl -i --max-time 20 -X POST 'http://localhost:7280/api/v1/otlp/v1/logs' \
  -H 'content-type: application/x-protobuf' \
  -H 'qw-otel-logs-index: otel-logs-v0_9' \
  --data-binary @valid-otlp.bin
# curl: (28) Operation timed out after 20001 milliseconds with 0 bytes received
```

A malformed or truncated protobuf body (`\x00`, or the first 50/100 bytes of a valid body) returns `400 "failed to decode Protobuf message"` instantly — proving the request reaches Quickwit and the parse path is fast. Only full valid bodies hang.


###  Disabling the SQS file source unblocks OTLP

I temporarily disabled  SQS file sources, then retried the Node OTLP client. OTLP started working within seconds — the log record landed in `otel-logs-v0_9` on the first attempt.

After enabling sqs source, currently endpoint is responsive, I will post any additional updates. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTLP HTTP endpoint hangs in certain conditions #6326

Summary

Environment

Workload

Recurring error pattern in Quickwit logs

Direct Quickwit (bypassing any proxy) repros

Disabling the SQS file source unblocks OTLP

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Value
Image	`quickwit/quickwit:v0.9.0-rc` (published 2026-04-19 on Docker Hub)
Deployment	Single-node via docker-compose, `command: run`
Host	AWS EC2, 4 vCPUs, not resource-constrained (CPU/memory under-utilised)
Target	`aarch64-unknown-linux-gnu`
Metastore	PostgreSQL
Storage	S3 (`s3://…/indexes/`, region `eu-west-1`)
`enabled_services` (chitchat)	`metastore,searcher,control_plane,janitor,indexer`
`ingester.status` (chitchat)	`ready`
`readiness` (chitchat)	`READY`

OTLP HTTP endpoint hangs in certain conditions #6326

Description

Summary

Environment

Workload

Recurring error pattern in Quickwit logs

Direct Quickwit (bypassing any proxy) repros

Disabling the SQS file source unblocks OTLP

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions