Summary
Explore adding a supervisor that watches live telemetry from the HTTP engine
and driver pools and adapts sizing automatically, so operators don't have
to hand-tune Workers, MaxOpen, and per-driver pool bounds for a moving
workload.
This spike should decide (a) whether to build it, (b) which of three
candidate mechanisms to build first, and (c) how the control loop and the
knobs look.
Why
Today the three sizing knobs that matter — celeris.Config.Workers,
postgres.WithMaxOpen, redis.WithMaxOpen — are static. An operator
guesses a value at startup and lives with it through every traffic
pattern. Real apps have:
- Diurnal load. Night-time idle at 10% of peak; daytime peaks
that would benefit from a fatter Postgres pool.
- Burst events. A cron job, a campaign, a publish event that
transiently spikes one subsystem.
- Workload drift. An app that was 80% Redis at launch ends up 80%
Postgres 18 months later.
The v1.4.0 design deliberately shares event-loop workers between the
HTTP core and drivers (single-goroutine locality from HTTP → DB → HTTP).
A naive "give Postgres X% of CPUs, celeris Y%" knob would break that
locality. The right knobs to tune adaptively are the pool connection
caps (and possibly the total worker count), not CPU allocation.
Spike questions to answer
1. Which mechanism is worth building?
Three candidates, ranked by architectural fit. The spike should confirm
or reorder:
- Adaptive connection-pool sizing (primary candidate). Controller
watches per-driver acquire-wait latency, idle ratio, p99 query
latency, and error rate; grows MaxOpen when acquire-wait climbs,
shrinks when idle ratio stays high. Operates entirely within the
existing driver/internal/async.Pool lifecycle (which already
supports dynamic open/close). No CPU partitioning, no FD migration.
- Adaptive total worker count. Grow/shrink the HTTP + driver
shared worker pool. Harder because FDs are pinned to a worker for
life — shrinking requires draining or accepting drain-on-next-idle
semantics. Grow-only is simpler and probably enough.
- Hot-worker FD rebalancing. When the FD→worker hash happens to
pile the heaviest connections on one worker while others sit idle,
migrate some. Narrow use case, complex because io_uring FD
registration is worker-local.
Which of these does a realistic workload actually benefit from?
Benchmarks on the benchmarks/ rig with skewed traffic should answer
this.
2. What telemetry signals drive the controller?
Candidate signals (each to be evaluated for signal-to-noise ratio and
controllability):
- Driver side:
Pool.Stats() already exposes Open, Idle, Waiting,
WaitCount, WaitDuration (per driver/internal/async/pool.go).
Add per-bucket query p99 latency + error rate if missing.
- Engine side: per-worker CPU time (from
/proc/self/task/*/stat or
Go's pprof goroutine scheduler), per-worker event-batch size,
per-worker pending-flush queue depth.
- Go runtime:
runtime.NumGoroutine(), GC pause duration, GOMAXPROCS.
- Application: request rate, p99 response time (already in
observe.Collector).
Decide which signals have the highest information content for the
chosen mechanism. Overly-rich signals → overfitting the controller;
too-thin → oscillation.
3. What's the control algorithm?
Options to compare:
- Threshold-based hysteresis. Simplest. "If acquire-wait p99 > X
for T seconds, grow by Δ; if idle ratio > Y for T seconds, shrink
by Δ." Deterministic, easy to reason about, easy to bound.
- PID controller. Classic feedback loop on acquire-wait-time as
error term. More tunable but requires careful damping.
- EWMA + trend detection. Exponential weighted moving averages
over 1s / 10s / 60s windows, grow/shrink on sustained trends.
- Learning approaches (contextual bandit, RL). Almost certainly
overkill and hard to reason about in a supervisor loop — document
as explicit non-goal for v1.5.
Recommend the simplest that passes a load-test rig simulating diurnal
4. What are the user-facing knobs?
The supervisor must be bounded by user-set safety rails. Draft API:
type PoolAutoscale struct {
Enabled bool
Min int // floor, e.g. 2
Max int // ceiling, e.g. 100
Target TargetSignal // AcquireWaitP99 | IdleRatio | Custom
Window time.Duration // averaging window (default 30s)
Step int // change per adjustment (default 1)
Cooldown time.Duration // min interval between adjustments (default 5s)
}
// postgres.WithAutoscale(cfg)
// redis.WithAutoscale(cfg)
Validate via spike: is this the minimum surface, or are there missing
controls?
5. How does it interact with existing config?
- If the user sets
MaxOpen explicitly and Autoscale.Enabled,
what wins? Proposal: MaxOpen becomes the initial value; Min/Max
bound subsequent adjustments. But spike should confirm this matches
operator expectations.
- Does
SetMaxOpenConns (on database/sql.DB) need to flow through?
- What's the graceful-shutdown / restart behavior?
6. What could go wrong?
- Oscillation. Controller grows → latency drops → controller
shrinks → latency rises → controller grows. Hysteresis + cooldowns
are the usual mitigation; verify empirically.
- Adversarial workloads. A retry storm looks like a traffic spike.
Growing the pool helps if the downstream is healthy, but can make
things worse if the downstream is already overloaded. Is there a
graceful degradation mode?
- Observability/debuggability. Operators will want to know why
the supervisor made each adjustment. Logging + observe.Collector
integration is table stakes.
- Testing. Time-based behaviors are hard to unit-test. Need a
fake clock and replay fixtures for traffic patterns.
Acceptance criteria for the spike
Out of scope for this spike
- CPU-core partitioning between drivers and the HTTP core (different
design goal, conflicts with shared-event-loop locality).
- Any form of ML-based controller (RL, contextual bandit, etc.) —
deterministic control first; learn-based approaches belong in a
later milestone if at all.
- Cross-process coordination (k8s HPA integration, multiple celeris
instances sharing state). Single-process only.
- Auto-scaling the number of HTTP listeners or TCP sockets — that's
a deployment concern, not a celeris concern.
Time box
Two weeks of engineering time. If the spike can't answer "which
mechanism to build" after two weeks, default to shipping only
adaptive-pool-sizing as v1.5 and defer the other two.
Summary
Explore adding a supervisor that watches live telemetry from the HTTP engine
and driver pools and adapts sizing automatically, so operators don't have
to hand-tune
Workers,MaxOpen, and per-driver pool bounds for a movingworkload.
This spike should decide (a) whether to build it, (b) which of three
candidate mechanisms to build first, and (c) how the control loop and the
knobs look.
Why
Today the three sizing knobs that matter —
celeris.Config.Workers,postgres.WithMaxOpen,redis.WithMaxOpen— are static. An operatorguesses a value at startup and lives with it through every traffic
pattern. Real apps have:
that would benefit from a fatter Postgres pool.
transiently spikes one subsystem.
Postgres 18 months later.
The v1.4.0 design deliberately shares event-loop workers between the
HTTP core and drivers (single-goroutine locality from HTTP → DB → HTTP).
A naive "give Postgres X% of CPUs, celeris Y%" knob would break that
locality. The right knobs to tune adaptively are the pool connection
caps (and possibly the total worker count), not CPU allocation.
Spike questions to answer
1. Which mechanism is worth building?
Three candidates, ranked by architectural fit. The spike should confirm
or reorder:
watches per-driver acquire-wait latency, idle ratio, p99 query
latency, and error rate; grows
MaxOpenwhen acquire-wait climbs,shrinks when idle ratio stays high. Operates entirely within the
existing
driver/internal/async.Poollifecycle (which alreadysupports dynamic open/close). No CPU partitioning, no FD migration.
shared worker pool. Harder because FDs are pinned to a worker for
life — shrinking requires draining or accepting drain-on-next-idle
semantics. Grow-only is simpler and probably enough.
pile the heaviest connections on one worker while others sit idle,
migrate some. Narrow use case, complex because io_uring FD
registration is worker-local.
Which of these does a realistic workload actually benefit from?
Benchmarks on the
benchmarks/rig with skewed traffic should answerthis.
2. What telemetry signals drive the controller?
Candidate signals (each to be evaluated for signal-to-noise ratio and
controllability):
Pool.Stats()already exposesOpen,Idle,Waiting,WaitCount,WaitDuration(perdriver/internal/async/pool.go).Add per-bucket query p99 latency + error rate if missing.
/proc/self/task/*/statorGo's pprof goroutine scheduler), per-worker event-batch size,
per-worker pending-flush queue depth.
runtime.NumGoroutine(), GC pause duration,GOMAXPROCS.observe.Collector).Decide which signals have the highest information content for the
chosen mechanism. Overly-rich signals → overfitting the controller;
too-thin → oscillation.
3. What's the control algorithm?
Options to compare:
for T seconds, grow by Δ; if idle ratio > Y for T seconds, shrink
by Δ." Deterministic, easy to reason about, easy to bound.
error term. More tunable but requires careful damping.
over 1s / 10s / 60s windows, grow/shrink on sustained trends.
overkill and hard to reason about in a supervisor loop — document
as explicit non-goal for v1.5.
Recommend the simplest that passes a load-test rig simulating diurnal
4. What are the user-facing knobs?
The supervisor must be bounded by user-set safety rails. Draft API:
Validate via spike: is this the minimum surface, or are there missing
controls?
5. How does it interact with existing config?
MaxOpenexplicitly andAutoscale.Enabled,what wins? Proposal:
MaxOpenbecomes the initial value;Min/Maxbound subsequent adjustments. But spike should confirm this matches
operator expectations.
SetMaxOpenConns(ondatabase/sql.DB) need to flow through?6. What could go wrong?
shrinks → latency rises → controller grows. Hysteresis + cooldowns
are the usual mitigation; verify empirically.
Growing the pool helps if the downstream is healthy, but can make
things worse if the downstream is already overloaded. Is there a
graceful degradation mode?
the supervisor made each adjustment. Logging +
observe.Collectorintegration is table stakes.
fake clock and replay fixtures for traffic patterns.
Acceptance criteria for the spike
docs/folder inside the benchmarks repo, not celeris itselfper project convention).
build first, with evidence from benchmarks.
Postgres (assuming it wins), using a fake clock and synthetic
traffic generator. Lives on a branch, not merged.
steady-state, diurnal sine wave, square-wave burst, slow drift.
Graph + benchstat-style summary.
PoolAutoscalestruct or equivalent)with a one-page godoc draft.
questions list for the v1.5 implementation ticket.
Out of scope for this spike
design goal, conflicts with shared-event-loop locality).
deterministic control first; learn-based approaches belong in a
later milestone if at all.
instances sharing state). Single-process only.
a deployment concern, not a celeris concern.
Time box
Two weeks of engineering time. If the spike can't answer "which
mechanism to build" after two weeks, default to shipping only
adaptive-pool-sizing as v1.5 and defer the other two.