spike: telemetry-driven adaptive pool + worker supervisor

## Summary

Explore adding a supervisor that watches live telemetry from the HTTP engine
and driver pools and adapts sizing automatically, so operators don't have
to hand-tune `Workers`, `MaxOpen`, and per-driver pool bounds for a moving
workload.

This spike should decide (a) *whether* to build it, (b) *which* of three
candidate mechanisms to build first, and (c) how the control loop and the
knobs look.

## Why

Today the three sizing knobs that matter — `celeris.Config.Workers`,
`postgres.WithMaxOpen`, `redis.WithMaxOpen` — are static. An operator
guesses a value at startup and lives with it through every traffic
pattern. Real apps have:

- **Diurnal load.** Night-time idle at 10% of peak; daytime peaks
  that would benefit from a fatter Postgres pool.
- **Burst events.** A cron job, a campaign, a publish event that
  transiently spikes one subsystem.
- **Workload drift.** An app that was 80% Redis at launch ends up 80%
  Postgres 18 months later.

The v1.4.0 design deliberately shares event-loop workers between the
HTTP core and drivers (single-goroutine locality from HTTP → DB → HTTP).
A naive "give Postgres X% of CPUs, celeris Y%" knob would break that
locality. The right knobs to tune adaptively are the pool connection
caps (and possibly the total worker count), *not* CPU allocation.

## Spike questions to answer

### 1. Which mechanism is worth building?

Three candidates, ranked by architectural fit. The spike should confirm
or reorder:

1. **Adaptive connection-pool sizing** (primary candidate). Controller
   watches per-driver acquire-wait latency, idle ratio, p99 query
   latency, and error rate; grows `MaxOpen` when acquire-wait climbs,
   shrinks when idle ratio stays high. Operates entirely within the
   existing `driver/internal/async.Pool` lifecycle (which already
   supports dynamic open/close). No CPU partitioning, no FD migration.
2. **Adaptive total worker count**. Grow/shrink the HTTP + driver
   shared worker pool. Harder because FDs are pinned to a worker for
   life — shrinking requires draining or accepting drain-on-next-idle
   semantics. Grow-only is simpler and probably enough.
3. **Hot-worker FD rebalancing**. When the FD→worker hash happens to
   pile the heaviest connections on one worker while others sit idle,
   migrate some. Narrow use case, complex because io_uring FD
   registration is worker-local.

Which of these does a realistic workload actually benefit from?
Benchmarks on the `benchmarks/` rig with skewed traffic should answer
this.

### 2. What telemetry signals drive the controller?

Candidate signals (each to be evaluated for signal-to-noise ratio and
controllability):

- Driver side: `Pool.Stats()` already exposes `Open`, `Idle`, `Waiting`,
  `WaitCount`, `WaitDuration` (per `driver/internal/async/pool.go`).
  Add per-bucket query p99 latency + error rate if missing.
- Engine side: per-worker CPU time (from `/proc/self/task/*/stat` or
  Go's pprof goroutine scheduler), per-worker event-batch size,
  per-worker pending-flush queue depth.
- Go runtime: `runtime.NumGoroutine()`, GC pause duration, `GOMAXPROCS`.
- Application: request rate, p99 response time (already in
  `observe.Collector`).

Decide which signals have the highest information content for the
chosen mechanism. Overly-rich signals → overfitting the controller;
too-thin → oscillation.

### 3. What's the control algorithm?

Options to compare:

- **Threshold-based hysteresis.** Simplest. "If acquire-wait p99 > X
  for T seconds, grow by Δ; if idle ratio > Y for T seconds, shrink
  by Δ." Deterministic, easy to reason about, easy to bound.
- **PID controller.** Classic feedback loop on acquire-wait-time as
  error term. More tunable but requires careful damping.
- **EWMA + trend detection.** Exponential weighted moving averages
  over 1s / 10s / 60s windows, grow/shrink on sustained trends.
- **Learning approaches** (contextual bandit, RL). Almost certainly
  overkill and hard to reason about in a supervisor loop — document
  as explicit non-goal for v1.5.

Recommend the simplest that passes a load-test rig simulating diurnal
+ burst + drift patterns.

### 4. What are the user-facing knobs?

The supervisor must be bounded by user-set safety rails. Draft API:

```go
type PoolAutoscale struct {
    Enabled bool
    Min     int           // floor, e.g. 2
    Max     int           // ceiling, e.g. 100
    Target  TargetSignal  // AcquireWaitP99 | IdleRatio | Custom
    Window  time.Duration // averaging window (default 30s)
    Step    int           // change per adjustment (default 1)
    Cooldown time.Duration // min interval between adjustments (default 5s)
}

// postgres.WithAutoscale(cfg)
// redis.WithAutoscale(cfg)
```

Validate via spike: is this the minimum surface, or are there missing
controls?

### 5. How does it interact with existing config?

- If the user sets `MaxOpen` explicitly *and* `Autoscale.Enabled`,
  what wins? Proposal: `MaxOpen` becomes the initial value; `Min`/`Max`
  bound subsequent adjustments. But spike should confirm this matches
  operator expectations.
- Does `SetMaxOpenConns` (on `database/sql.DB`) need to flow through?
- What's the graceful-shutdown / restart behavior?

### 6. What could go wrong?

- **Oscillation.** Controller grows → latency drops → controller
  shrinks → latency rises → controller grows. Hysteresis + cooldowns
  are the usual mitigation; verify empirically.
- **Adversarial workloads.** A retry storm looks like a traffic spike.
  Growing the pool helps if the downstream is healthy, but can make
  things worse if the downstream is already overloaded. Is there a
  graceful degradation mode?
- **Observability/debuggability.** Operators will want to know *why*
  the supervisor made each adjustment. Logging + `observe.Collector`
  integration is table stakes.
- **Testing.** Time-based behaviors are hard to unit-test. Need a
  fake clock and replay fixtures for traffic patterns.

## Acceptance criteria for the spike

- [ ] Written design document (in the spike's PR description or a
      `docs/` folder inside the benchmarks repo, **not** celeris itself
      per project convention).
- [ ] Recommendation on which of the three mechanisms (#1/#2/#3) to
      build first, with evidence from benchmarks.
- [ ] Proof-of-concept implementation of adaptive-pool-sizing for
      Postgres (assuming it wins), using a fake clock and synthetic
      traffic generator. Lives on a branch, not merged.
- [ ] Controller behavior characterized on at least these patterns:
      steady-state, diurnal sine wave, square-wave burst, slow drift.
      Graph + benchstat-style summary.
- [ ] User-facing API proposal (`PoolAutoscale` struct or equivalent)
      with a one-page godoc draft.
- [ ] Estimate of effort for production implementation + open
      questions list for the v1.5 implementation ticket.

## Out of scope for this spike

- CPU-core partitioning between drivers and the HTTP core (different
  design goal, conflicts with shared-event-loop locality).
- Any form of ML-based controller (RL, contextual bandit, etc.) —
  deterministic control first; learn-based approaches belong in a
  later milestone if at all.
- Cross-process coordination (k8s HPA integration, multiple celeris
  instances sharing state). Single-process only.
- Auto-scaling the number of HTTP *listeners* or TCP sockets — that's
  a deployment concern, not a celeris concern.

## Time box

Two weeks of engineering time. If the spike can't answer "which
mechanism to build" after two weeks, default to shipping only
adaptive-pool-sizing as v1.5 and defer the other two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike: telemetry-driven adaptive pool + worker supervisor #237

Summary

Why

Spike questions to answer

1. Which mechanism is worth building?

2. What telemetry signals drive the controller?

3. What's the control algorithm?

4. What are the user-facing knobs?

5. How does it interact with existing config?

6. What could go wrong?

Acceptance criteria for the spike

Out of scope for this spike

Time box

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

spike: telemetry-driven adaptive pool + worker supervisor #237

Description

Summary

Why

Spike questions to answer

1. Which mechanism is worth building?

2. What telemetry signals drive the controller?

3. What's the control algorithm?

4. What are the user-facing knobs?

5. How does it interact with existing config?

6. What could go wrong?

Acceptance criteria for the spike

Out of scope for this spike

Time box

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions