Skip to content

spike: telemetry-driven adaptive pool + worker supervisor #237

@FumingPower3925

Description

@FumingPower3925

Summary

Explore adding a supervisor that watches live telemetry from the HTTP engine
and driver pools and adapts sizing automatically, so operators don't have
to hand-tune Workers, MaxOpen, and per-driver pool bounds for a moving
workload.

This spike should decide (a) whether to build it, (b) which of three
candidate mechanisms to build first, and (c) how the control loop and the
knobs look.

Why

Today the three sizing knobs that matter — celeris.Config.Workers,
postgres.WithMaxOpen, redis.WithMaxOpen — are static. An operator
guesses a value at startup and lives with it through every traffic
pattern. Real apps have:

  • Diurnal load. Night-time idle at 10% of peak; daytime peaks
    that would benefit from a fatter Postgres pool.
  • Burst events. A cron job, a campaign, a publish event that
    transiently spikes one subsystem.
  • Workload drift. An app that was 80% Redis at launch ends up 80%
    Postgres 18 months later.

The v1.4.0 design deliberately shares event-loop workers between the
HTTP core and drivers (single-goroutine locality from HTTP → DB → HTTP).
A naive "give Postgres X% of CPUs, celeris Y%" knob would break that
locality. The right knobs to tune adaptively are the pool connection
caps (and possibly the total worker count), not CPU allocation.

Spike questions to answer

1. Which mechanism is worth building?

Three candidates, ranked by architectural fit. The spike should confirm
or reorder:

  1. Adaptive connection-pool sizing (primary candidate). Controller
    watches per-driver acquire-wait latency, idle ratio, p99 query
    latency, and error rate; grows MaxOpen when acquire-wait climbs,
    shrinks when idle ratio stays high. Operates entirely within the
    existing driver/internal/async.Pool lifecycle (which already
    supports dynamic open/close). No CPU partitioning, no FD migration.
  2. Adaptive total worker count. Grow/shrink the HTTP + driver
    shared worker pool. Harder because FDs are pinned to a worker for
    life — shrinking requires draining or accepting drain-on-next-idle
    semantics. Grow-only is simpler and probably enough.
  3. Hot-worker FD rebalancing. When the FD→worker hash happens to
    pile the heaviest connections on one worker while others sit idle,
    migrate some. Narrow use case, complex because io_uring FD
    registration is worker-local.

Which of these does a realistic workload actually benefit from?
Benchmarks on the benchmarks/ rig with skewed traffic should answer
this.

2. What telemetry signals drive the controller?

Candidate signals (each to be evaluated for signal-to-noise ratio and
controllability):

  • Driver side: Pool.Stats() already exposes Open, Idle, Waiting,
    WaitCount, WaitDuration (per driver/internal/async/pool.go).
    Add per-bucket query p99 latency + error rate if missing.
  • Engine side: per-worker CPU time (from /proc/self/task/*/stat or
    Go's pprof goroutine scheduler), per-worker event-batch size,
    per-worker pending-flush queue depth.
  • Go runtime: runtime.NumGoroutine(), GC pause duration, GOMAXPROCS.
  • Application: request rate, p99 response time (already in
    observe.Collector).

Decide which signals have the highest information content for the
chosen mechanism. Overly-rich signals → overfitting the controller;
too-thin → oscillation.

3. What's the control algorithm?

Options to compare:

  • Threshold-based hysteresis. Simplest. "If acquire-wait p99 > X
    for T seconds, grow by Δ; if idle ratio > Y for T seconds, shrink
    by Δ." Deterministic, easy to reason about, easy to bound.
  • PID controller. Classic feedback loop on acquire-wait-time as
    error term. More tunable but requires careful damping.
  • EWMA + trend detection. Exponential weighted moving averages
    over 1s / 10s / 60s windows, grow/shrink on sustained trends.
  • Learning approaches (contextual bandit, RL). Almost certainly
    overkill and hard to reason about in a supervisor loop — document
    as explicit non-goal for v1.5.

Recommend the simplest that passes a load-test rig simulating diurnal

  • burst + drift patterns.

4. What are the user-facing knobs?

The supervisor must be bounded by user-set safety rails. Draft API:

type PoolAutoscale struct {
    Enabled bool
    Min     int           // floor, e.g. 2
    Max     int           // ceiling, e.g. 100
    Target  TargetSignal  // AcquireWaitP99 | IdleRatio | Custom
    Window  time.Duration // averaging window (default 30s)
    Step    int           // change per adjustment (default 1)
    Cooldown time.Duration // min interval between adjustments (default 5s)
}

// postgres.WithAutoscale(cfg)
// redis.WithAutoscale(cfg)

Validate via spike: is this the minimum surface, or are there missing
controls?

5. How does it interact with existing config?

  • If the user sets MaxOpen explicitly and Autoscale.Enabled,
    what wins? Proposal: MaxOpen becomes the initial value; Min/Max
    bound subsequent adjustments. But spike should confirm this matches
    operator expectations.
  • Does SetMaxOpenConns (on database/sql.DB) need to flow through?
  • What's the graceful-shutdown / restart behavior?

6. What could go wrong?

  • Oscillation. Controller grows → latency drops → controller
    shrinks → latency rises → controller grows. Hysteresis + cooldowns
    are the usual mitigation; verify empirically.
  • Adversarial workloads. A retry storm looks like a traffic spike.
    Growing the pool helps if the downstream is healthy, but can make
    things worse if the downstream is already overloaded. Is there a
    graceful degradation mode?
  • Observability/debuggability. Operators will want to know why
    the supervisor made each adjustment. Logging + observe.Collector
    integration is table stakes.
  • Testing. Time-based behaviors are hard to unit-test. Need a
    fake clock and replay fixtures for traffic patterns.

Acceptance criteria for the spike

  • Written design document (in the spike's PR description or a
    docs/ folder inside the benchmarks repo, not celeris itself
    per project convention).
  • Recommendation on which of the three mechanisms (Project scaffolding and CI pipeline #1/Engine interface and core types #2/HTTP/1.1 parser with zero-copy design #3) to
    build first, with evidence from benchmarks.
  • Proof-of-concept implementation of adaptive-pool-sizing for
    Postgres (assuming it wins), using a fake clock and synthetic
    traffic generator. Lives on a branch, not merged.
  • Controller behavior characterized on at least these patterns:
    steady-state, diurnal sine wave, square-wave burst, slow drift.
    Graph + benchstat-style summary.
  • User-facing API proposal (PoolAutoscale struct or equivalent)
    with a one-page godoc draft.
  • Estimate of effort for production implementation + open
    questions list for the v1.5 implementation ticket.

Out of scope for this spike

  • CPU-core partitioning between drivers and the HTTP core (different
    design goal, conflicts with shared-event-loop locality).
  • Any form of ML-based controller (RL, contextual bandit, etc.) —
    deterministic control first; learn-based approaches belong in a
    later milestone if at all.
  • Cross-process coordination (k8s HPA integration, multiple celeris
    instances sharing state). Single-process only.
  • Auto-scaling the number of HTTP listeners or TCP sockets — that's
    a deployment concern, not a celeris concern.

Time box

Two weeks of engineering time. If the spike can't answer "which
mechanism to build" after two weeks, default to shipping only
adaptive-pool-sizing as v1.5 and defer the other two.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/driverDatabase/cache driver infrastructurearea/engineEngine interface or implementationarea/observabilityMetrics, logging, debug endpoints

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions