scheduler: make gang plugin configurable and fix is_ready/is_fulfilled semantics#500
scheduler: make gang plugin configurable and fix is_ready/is_fulfilled semantics#500jinzhejz wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the scheduler's plugin readiness checks (is_ready and is_fulfilled) to return Option<bool> instead of bool, allowing the scheduler to distinguish between a plugin having no opinion (None) and explicitly confirming batch completion (Some(true)). This fixes issues where sessions would be skipped when gang scheduling was disabled. However, the code review identified a critical bug in the new formulas for is_ready and is_fulfilled in gang.rs. Because they return Some(true) at the start of a cycle if a completed batch already exists, the allocation and dispatch actions will prematurely skip the session, preventing it from scaling up to subsequent batches when there are pending tasks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| fn is_ready(&self, ssn: &SessionInfoPtr) -> Option<bool> { | ||
| let state = self.ssn_state.get(&ssn.id)?; | ||
| let total = state.allocated + state.pipelined; | ||
| Some(state.pipelined > 0 && total % state.batch_size == 0) | ||
| Some(total > 0 && total % state.batch_size == 0) | ||
| } |
There was a problem hiding this comment.
Critical Bug: Gang Sessions Cannot Scale Beyond the First Batch
The new formula total > 0 && total % state.batch_size == 0 has a critical flaw: if a session already has a completed batch of executors (e.g., allocated = 2 for batch_size = 2), is_ready will return Some(true) at the very start of the scheduling cycle (when pipelined is 0).
Because is_ready returns Some(true), the pre-check in AllocateAction::execute will immediately skip the session:
if fulfilled == Some(true) || ready == Some(true) {
continue; // Skips allocation entirely!
}This means the session can never scale up to allocate a second or subsequent batch of executors, even if there are many pending tasks!
Suggested Fix
To allow scaling up while still preventing duplicate allocations during on_session_enter (when the pending tasks are already covered by Binding executors), we should check if the total number of executors (allocated + pipelined) is already sufficient to cover the active tasks (pending + running). If not, we should only consider the batch complete if we have actually allocated new executors (pipelined > 0) and the total is a multiple of batch_size.
| fn is_ready(&self, ssn: &SessionInfoPtr) -> Option<bool> { | |
| let state = self.ssn_state.get(&ssn.id)?; | |
| let total = state.allocated + state.pipelined; | |
| Some(state.pipelined > 0 && total % state.batch_size == 0) | |
| Some(total > 0 && total % state.batch_size == 0) | |
| } | |
| fn is_ready(&self, ssn: &SessionInfoPtr) -> Option<bool> { | |
| let state = self.ssn_state.get(&ssn.id)?; | |
| let total = state.allocated + state.pipelined; | |
| let pending = ssn.tasks_status.get(&common::apis::TaskState::Pending).copied().unwrap_or(0); | |
| let running = ssn.tasks_status.get(&common::apis::TaskState::Running).copied().unwrap_or(0); | |
| let needed = pending + running; | |
| Some(total > 0 && total % state.batch_size == 0 && (total >= needed || state.pipelined > 0)) | |
| } |
| fn is_fulfilled(&self, ssn: &SessionInfoPtr) -> Option<bool> { | ||
| let state = self.ssn_state.get(&ssn.id)?; | ||
| let total = state.allocated + state.bound; | ||
| Some(state.bound > 0 && total % state.batch_size == 0) | ||
| Some(total > 0 && total % state.batch_size == 0) | ||
| } |
There was a problem hiding this comment.
Critical Bug: Gang Sessions Cannot Dispatch Beyond the First Batch
Similarly to is_ready, the new formula for is_fulfilled (total > 0 && total % state.batch_size == 0) will return Some(true) at the start of the scheduling cycle if the session already has a completed batch of executors (e.g., allocated = 2 for batch_size = 2).
This causes DispatchAction::execute to skip dispatching idle executors to the session entirely:
if ctx.is_fulfilled(&ssn)? == Some(true) {
continue; // Skips dispatch entirely!
}As a result, the session can never bind more idle executors beyond the first batch.
Suggested Fix
Apply the same logic as is_ready by checking if the total executors (allocated + bound) cover the active tasks, or if we have bound new executors (bound > 0) and completed a batch.
fn is_fulfilled(&self, ssn: &SessionInfoPtr) -> Option<bool> {
let state = self.ssn_state.get(&ssn.id)?;
let total = state.allocated + state.bound;
let pending = ssn.tasks_status.get(&common::apis::TaskState::Pending).copied().unwrap_or(0);
let running = ssn.tasks_status.get(&common::apis::TaskState::Running).copied().unwrap_or(0);
let needed = pending + running;
Some(total > 0 && total % state.batch_size == 0 && (total >= needed || state.bound > 0))
}
Can we keep bool? for is_ready/is_fulfilled in each plugin, None means this plugin should be ignore. |
5450736 to
be7d549
Compare
fixed it |
4fcf1bf to
1b5d5f8
Compare
…ch semantics
Three interrelated bugs prevented correct executor allocation depending
on which scheduler plugins were configured.
Bug 1 — executors never allocated without gang (DRF-only, priority-only)
PluginManager aggregated is_ready/is_fulfilled with unwrap_or(true),
meaning when no plugin had a batch opinion the result was true.
AllocateAction's pre-check (`if fulfilled || ready { skip }`) and
DispatchAction's commit condition (`if fulfilled { commit }`) therefore
skipped or no-op'd every session, leaving tasks permanently pending.
Fix: Plugin trait defaults for is_ready/is_fulfilled changed from false
to true (no opinion = do not block). PluginManager aggregates with
all() — returns true only when every plugin agrees; a single false vetos.
When no gang plugin is loaded, PluginManager tracks per-session op counts
(no_gang_alloc_ops / no_gang_bind_ops) and returns false before the first
op and true after, so callers use a uniform `if is_ready() { break; }`
without needing to know whether gang is configured.
Bug 2 — gang only ever created 1 executor regardless of task count
The old readiness formula `total % batch_size == 0` was true at
every multiple, so after one allocation (total=1, batch_size=1) the
loop broke and no further executors were created for the remaining
pending tasks.
Fix: introduce `incomplete_tasks` in GangState (sampled once per cycle
as Pending + Running task count) and compute:
needed = div_ceil(incomplete_tasks, batch_size) * batch_size
ready = needed == 0 || total >= needed
This creates enough executors to cover all incomplete tasks rounded up
to the nearest full batch per scheduling cycle.
Bug 3 — duplicate executor creation when Dispatch and Allocate ran in the
same cycle for a gang session
DispatchAction lacked a pre-session guard: a session whose executor
was still in Binding state (counted in Gang's `allocated`) received
additional bindings every cycle.
Fix: check is_fulfilled before entering the bind loop; skip the session
if it is already satisfied.
Additional changes:
- Remove "drf" from DEFAULT_POLICIES; DRF remains opt-in via
cluster.policies. Default stack is now priority + gang + shim.
- Allow listing "shim" in cluster.policies without error (log warning,
ignore); shim is always-on.
- Regression tests added for all affected configurations:
· drf-only and priority-only: executor is allocated
· gang batch_size=1: one executor per incomplete task per cycle
· gang batch_size=2: exactly two executors per cycle
· no-gang: at most one executor per cycle
· dispatch without gang: idle executor is bound
· plugin trait default aggregation edge cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixed #498
Three interrelated bugs prevented correct executor allocation depending
on which scheduler plugins were configured.
Bug 1 — executors never allocated without gang (DRF-only, priority-only)
PluginManager aggregated is_ready/is_fulfilled with unwrap_or(true),
meaning when no plugin had a batch opinion the result was true.
AllocateAction's pre-check (
if fulfilled || ready { skip }) andDispatchAction's commit condition (
if fulfilled { commit }) thereforeskipped or no-op'd every session, leaving tasks permanently pending.
Fix: Plugin trait defaults for is_ready/is_fulfilled changed from false
to true (no opinion = do not block). PluginManager aggregates with
all() — returns true only when every plugin agrees; a single false vetos.
When no gang plugin is loaded, PluginManager tracks per-session op counts
(no_gang_alloc_ops / no_gang_bind_ops) and returns false before the first
op and true after, so callers use a uniform
if is_ready() { break; }without needing to know whether gang is configured.
Bug 2 — gang only ever created 1 executor regardless of task count
The old readiness formula
total % batch_size == 0was true atevery multiple, so after one allocation (total=1, batch_size=1) the
loop broke and no further executors were created for the remaining
pending tasks.
Fix: introduce
incomplete_tasksin GangState (sampled once per cycleas Pending + Running task count) and compute:
needed = div_ceil(incomplete_tasks, batch_size) * batch_size
ready = needed == 0 || total >= needed
This creates enough executors to cover all incomplete tasks rounded up
to the nearest full batch per scheduling cycle.
Bug 3 — duplicate executor creation when Dispatch and Allocate ran in the
same cycle for a gang session
DispatchAction lacked a pre-session guard: a session whose executor
was still in Binding state (counted in Gang's
allocated) receivedadditional bindings every cycle.
Fix: check is_fulfilled before entering the bind loop; skip the session
if it is already satisfied.
Additional changes:
cluster.policies. Default stack is now priority + gang + shim.
ignore); shim is always-on.
· drf-only and priority-only: executor is allocated
· gang batch_size=1: one executor per incomplete task per cycle
· gang batch_size=2: exactly two executors per cycle
· no-gang: at most one executor per cycle
· dispatch without gang: idle executor is bound
· plugin trait default aggregation edge cases