Add /system/standby API for out-of-band scale-to-zero pin#237
Add /system/standby API for out-of-band scale-to-zero pin#237sjmiller609 wants to merge 5 commits into
Conversation
Adds two new endpoints to the kernel-images server: - POST /system/standby/disable — pins scale-to-zero off until released - POST /system/standby/enable — releases the pin The pin lives alongside the existing request-driven middleware refcount in DebouncedController: scale-to-zero stays disabled while either holders are inflight requests OR the pin is held. Request-driven Enable calls do not release the pin, so a pinned VM survives idle periods. Releasing the pin honors any configured re-enable cooldown. This is the in-VM surface for future control-plane integrations (e.g. a hot-pool controller reserving a VM until it is claimed). Control-plane wiring will follow in metro-api and the API server. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spins up the headless image via testcontainers and exercises: - Idempotent disable (two consecutive 204s) - A normal request flows while pinned (middleware coexistence) - Idempotent enable (two consecutive 204s) The unikraft control file does not exist inside the docker test container, so the underlying scale-to-zero write is a no-op. The test validates HTTP wiring and handler/middleware coexistence; the deep pin semantics are covered by unit tests against DebouncedController. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address review feedback:
- Path /system/standby/* implied VM-state mutation; rename to
/scaletozero/{pin,unpin} so the operation is specific to the
scale-to-zero gate.
- Interface methods DisablePin/EnablePin read as inverted; rename to
Pin/Unpin for clarity.
- Rewrite openapi summary/description to be caller-focused (what it
does, when to call, what pairs with).
Match user-facing terminology to the action ("disable scale to zero")
rather than the internal pin mechanism. Internal PinnedController.Pin/Unpin
methods retain pin/unpin naming since they're distinct from the refcounted
Controller.Disable/Enable.
Rename refcounted hold methods to Acquire/Release so that Disable/Enable
can carry the idempotent persistent-toggle semantics defined by the
/scaletozero/{disable,enable} API. Split the low-level direct toggle out
into a separate Toggler interface (unikraftCloudToggler) wrapped by
DebouncedController.
Monitoring Plan:
|
| /system/standby/disable: | ||
| post: | ||
| summary: Pin scale-to-zero off until /system/standby/enable is called | ||
| description: > | ||
| Disables scale-to-zero out-of-band of the request-driven middleware, | ||
| holding it disabled across idle periods. The pin is independent of the | ||
| inflight-request refcount: request-driven Enable calls will not release | ||
| it. Idempotent — repeated calls have no additional effect. | ||
| operationId: disableStandby | ||
| responses: | ||
| "204": | ||
| description: Standby pinned disabled | ||
| "500": | ||
| $ref: "#/components/responses/InternalError" | ||
| /system/standby/enable: | ||
| post: | ||
| summary: Release the standby pin set by /system/standby/disable | ||
| description: > | ||
| Releases the out-of-band scale-to-zero pin. If no request-driven | ||
| holders remain, scale-to-zero re-enables (honoring any configured | ||
| cooldown). Idempotent — calling without a held pin has no effect. |
There was a problem hiding this comment.
pls deslop descriptions. these read like internal implementation notes rather than information a caller needs: what does calling this endpoint do for me, when do I call it, and what pairs with it
There was a problem hiding this comment.
Rewrote both summary/description to be caller-focused — what the endpoint does, when to call it, and what pairs with it. See a41fed3.
There was a problem hiding this comment.
hold on, I'm manually reviewing these now
| text/event-stream: | ||
| schema: | ||
| $ref: "#/components/schemas/PublishedEnvelope" | ||
| /system/standby/disable: |
There was a problem hiding this comment.
would recommend making the path/operation more specific to scaletozero since the current naming could imply it's actually mutating the state of the vm, which is not what we're doin
There was a problem hiding this comment.
Renamed paths to /scaletozero/pin and /scaletozero/unpin, operationIds to pinScaleToZero/unpinScaleToZero. a41fed3.
| // DisablePin pins scale-to-zero disabled until EnablePin is called. The | ||
| // pin is a boolean, not a counter: repeated calls are idempotent. | ||
| DisablePin(ctx context.Context) error | ||
| // EnablePin releases the pin. If no request-driven holders remain, | ||
| // scale-to-zero is re-enabled (honoring any configured cooldown). | ||
| EnablePin(ctx context.Context) error |
There was a problem hiding this comment.
these operations seem inverted. I had to re-read this a few times to wrap my head around it. would recommend revisiting semantics here to simplify. even Pin / Unpin would be clearer
There was a problem hiding this comment.
Renamed to Pin/Unpin on the PinnedController interface, DebouncedController, and NoopController. a41fed3.
| text/event-stream: | ||
| schema: | ||
| $ref: "#/components/schemas/PublishedEnvelope" | ||
| /scaletozero/pin: |
There was a problem hiding this comment.
I think the path like enable / disable is more clear
| factory recorder.FFmpegRecorderFactory, | ||
| upstreamMgr *devtoolsproxy.UpstreamManager, | ||
| stz scaletozero.Controller, | ||
| stz scaletozero.PinnedController, |
There was a problem hiding this comment.
getting rid of "pin" terminology
Summary
Adds two new endpoints to the kernel-images server:
POST /system/standby/disable— pins scale-to-zero off until releasedPOST /system/standby/enable— releases the pinThe pin lives alongside the existing request-driven middleware refcount inside
DebouncedController:Enable(from middleware) does not release the pinDebouncedControllerandNoopControllernow also implement a newPinnedControllersub-interface (Controller+DisablePin/EnablePin). The pin is a boolean —DisablePin/EnablePinare idempotent.Why
This is the in-VM surface needed for a future control-plane integration: an external system (e.g. a hot-pool controller) needs to hold a VM out of standby while it sits idle in a pool, then release the hold when the VM is claimed.
The existing middleware refcount only works for inflight HTTP requests, so it can't hold a VM disabled across an idle period.
Notes for reviewers
NewDebouncedController*widened fromControllerto*DebouncedControllerso callers can access the pin methods. Only existing caller iscmd/api/main.go, which is unaffected since*DebouncedControllerstill satisfiesControllerforrecorder.NewFFmpegRecorderFactoryandscaletozero.Middleware.Test plan
go test -race ./lib/scaletozero/...passes (6 new tests covering pin semantics)go test -race ./cmd/api/...passesgo vet ./...cleango build ./...cleanlib/oapi/oapi.goNote
Medium Risk
Updates core scale-to-zero coordination logic (new refcounted
Acquire/Releaseplus persistentDisable/Enable) and threads it through request middleware and ffmpeg recording, so regressions could affect VM standby behavior and recording reliability.Overview
Adds new out-of-band API endpoints
POST /scaletozero/disableandPOST /scaletozero/enable(OpenAPI + generatedoapiclient/server +ApiServicehandlers) to persistently pin scale-to-zero off/on.Refactors scale-to-zero control to separate a low-level
Togglerfrom a higher-levelControllerthat now supports both refcounted holds (Acquire/Release, used by HTTP middleware and ffmpeg) and an idempotent persistent override (Disable/Enable, used by the new API), including cooldown-aware re-enable logic.Updates the ffmpeg recorder and scaletozero middleware to use
Acquire/Releaseinstead of directly toggling, and adds unit + e2e coverage to validate idempotency and that normal requests continue to work while pinned.Reviewed by Cursor Bugbot for commit 7d26ce6. Bugbot is set up for automated code reviews on this repo. Configure here.