fix(state): pending sessions — stop holding the exclusive lock through provisioning#20
Merged
Merged
Conversation
cmd_up held the exclusive state lock across the entire bring_up — docker pulls, compose up, worktree creation, and user hooks with no timeout — while lock acquisition gives up after 10s. Parallel `up`s serialized and failed with LockTimeout, and every read command (ls/env/status/shell) failed while any up was in flight. down and shutdown additionally showed interactive prompts while holding the lock. Sessions now carry a status (active | pending, serde-compatible with old state files). up reserves the slug + slot by committing a pending session in a short exclusive section, releases the lock, provisions, then re-acquires to finalize (or drop the reservation on failure — bring_up already rolled back the resources). Resume and down mark the session pending the same way and restore it on failure; shutdown resolves prompts and runs teardown outside the lock per session. Pending sessions: reserve their slot, show as '(pending)' in ls, are refused by up/env/status/shell with an actionable error, and can be cleaned with `ecluse down <slug>` if the owning operation crashed. Fixes #1 https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1
Problem
cmd_upacquired the exclusive state lock and held it through the entirebring_up— docker pulls,compose up, worktree creation, and user-defined hooks (no timeout) — and lock acquisition times out after 10s. Consequences for the core use case (N agents in parallel):ecluse ups serialized; the second failed withLockTimeoutwhenever the first took >10s (first-time image pull, migration hook)ls/env/status/shell(shared lock) failed while anyupwas in flightdown/shutdownshowed interactive prompts while holding the exclusive lock — an unanswered prompt blocked every ecluse command in the repoDesign
Sessions get a
status: active | pendingfield (#[serde(default)]+ skipped when active, so existingstate.jsonfiles round-trip byte-identical).up (new session): short exclusive section → allocate slot → commit a pending session (reserves slug + slot against concurrent
ups) → release lock → provision → re-acquire → swap in the real session, or drop the reservation on failure (bring_upalready rolled its resources back, see #2/#19).up (resume): same pattern — mark pending, release, health-check + start services, re-acquire, replace or restore.
down: resolve target from a shared-lock snapshot, mark pending in a short exclusive section, prompt and tear down without the lock, then remove (or restore as active if teardown failed).
shutdown: per-session prompts and teardown all run outside the lock, with re-verification under the lock before each marking.
Pending sessions:
used_slotsunchanged)slug (pending)inls(and"status": "pending"in--json)up/env/status/shellagainst them error actionably ("operation in progress… runecluse down <slug>if it crashed")downworks on them — that's the recovery path for anupthat crashed between reserve and finalizeTests
post_up = "sleep 3"hook):ls_works_while_up_is_provisioning—lsreturns promptly with the(pending)marker whileupsleeps; this timed out after 10s before the fixup_on_pending_session_errors_actionably— concurrentupon the same slug can't race the reservationfailed_up_removes_pending_reservation— failed provisioning frees the slot and leaveslsemptycargo fmt --check,cargo clippy -- -D warnings,cargo test(373 + 21) green.Note: this updates the AGENTS.md-stated invariant "the lock is held for the entire duration of up and down" in spirit — the docs file itself is fixed in the docs-drift PR for #15.
cmd_syncstill does its (fast, discovery-only) work under the lock; unchanged here.https://claude.ai/code/session_017UcuvzMKHVfyBCcq8ipAko
Generated by Claude Code