Skip to content

scheduler: make gang plugin configurable and fix is_ready/is_fulfilled semantics#500

Open
jinzhejz wants to merge 1 commit into
xflops:mainfrom
jinzhejz:bug_fix
Open

scheduler: make gang plugin configurable and fix is_ready/is_fulfilled semantics#500
jinzhejz wants to merge 1 commit into
xflops:mainfrom
jinzhejz:bug_fix

Conversation

@jinzhejz

@jinzhejz jinzhejz commented Jun 11, 2026

Copy link
Copy Markdown

Fixed #498

Three interrelated bugs prevented correct executor allocation depending
on which scheduler plugins were configured.

Bug 1 — executors never allocated without gang (DRF-only, priority-only)

PluginManager aggregated is_ready/is_fulfilled with unwrap_or(true),
meaning when no plugin had a batch opinion the result was true.
AllocateAction's pre-check (if fulfilled || ready { skip }) and
DispatchAction's commit condition (if fulfilled { commit }) therefore
skipped or no-op'd every session, leaving tasks permanently pending.

Fix: Plugin trait defaults for is_ready/is_fulfilled changed from false
to true (no opinion = do not block). PluginManager aggregates with
all() — returns true only when every plugin agrees; a single false vetos.
When no gang plugin is loaded, PluginManager tracks per-session op counts
(no_gang_alloc_ops / no_gang_bind_ops) and returns false before the first
op and true after, so callers use a uniform if is_ready() { break; }
without needing to know whether gang is configured.

Bug 2 — gang only ever created 1 executor regardless of task count

The old readiness formula total % batch_size == 0 was true at
every multiple, so after one allocation (total=1, batch_size=1) the
loop broke and no further executors were created for the remaining
pending tasks.

Fix: introduce incomplete_tasks in GangState (sampled once per cycle
as Pending + Running task count) and compute:
needed = div_ceil(incomplete_tasks, batch_size) * batch_size
ready = needed == 0 || total >= needed
This creates enough executors to cover all incomplete tasks rounded up
to the nearest full batch per scheduling cycle.

Bug 3 — duplicate executor creation when Dispatch and Allocate ran in the
same cycle for a gang session

DispatchAction lacked a pre-session guard: a session whose executor
was still in Binding state (counted in Gang's allocated) received
additional bindings every cycle.

Fix: check is_fulfilled before entering the bind loop; skip the session
if it is already satisfied.

Additional changes:

  • Remove "drf" from DEFAULT_POLICIES; DRF remains opt-in via
    cluster.policies. Default stack is now priority + gang + shim.
  • Allow listing "shim" in cluster.policies without error (log warning,
    ignore); shim is always-on.
  • Regression tests added for all affected configurations:
    · drf-only and priority-only: executor is allocated
    · gang batch_size=1: one executor per incomplete task per cycle
    · gang batch_size=2: exactly two executors per cycle
    · no-gang: at most one executor per cycle
    · dispatch without gang: idle executor is bound
    · plugin trait default aggregation edge cases

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the scheduler's plugin readiness checks (is_ready and is_fulfilled) to return Option<bool> instead of bool, allowing the scheduler to distinguish between a plugin having no opinion (None) and explicitly confirming batch completion (Some(true)). This fixes issues where sessions would be skipped when gang scheduling was disabled. However, the code review identified a critical bug in the new formulas for is_ready and is_fulfilled in gang.rs. Because they return Some(true) at the start of a cycle if a completed batch already exists, the allocation and dispatch actions will prematurely skip the session, preventing it from scaling up to subsequent batches when there are pending tasks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 89 to 93
fn is_ready(&self, ssn: &SessionInfoPtr) -> Option<bool> {
let state = self.ssn_state.get(&ssn.id)?;
let total = state.allocated + state.pipelined;
Some(state.pipelined > 0 && total % state.batch_size == 0)
Some(total > 0 && total % state.batch_size == 0)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Bug: Gang Sessions Cannot Scale Beyond the First Batch

The new formula total > 0 && total % state.batch_size == 0 has a critical flaw: if a session already has a completed batch of executors (e.g., allocated = 2 for batch_size = 2), is_ready will return Some(true) at the very start of the scheduling cycle (when pipelined is 0).

Because is_ready returns Some(true), the pre-check in AllocateAction::execute will immediately skip the session:

if fulfilled == Some(true) || ready == Some(true) {
    continue; // Skips allocation entirely!
}

This means the session can never scale up to allocate a second or subsequent batch of executors, even if there are many pending tasks!

Suggested Fix

To allow scaling up while still preventing duplicate allocations during on_session_enter (when the pending tasks are already covered by Binding executors), we should check if the total number of executors (allocated + pipelined) is already sufficient to cover the active tasks (pending + running). If not, we should only consider the batch complete if we have actually allocated new executors (pipelined > 0) and the total is a multiple of batch_size.

Suggested change
fn is_ready(&self, ssn: &SessionInfoPtr) -> Option<bool> {
let state = self.ssn_state.get(&ssn.id)?;
let total = state.allocated + state.pipelined;
Some(state.pipelined > 0 && total % state.batch_size == 0)
Some(total > 0 && total % state.batch_size == 0)
}
fn is_ready(&self, ssn: &SessionInfoPtr) -> Option<bool> {
let state = self.ssn_state.get(&ssn.id)?;
let total = state.allocated + state.pipelined;
let pending = ssn.tasks_status.get(&common::apis::TaskState::Pending).copied().unwrap_or(0);
let running = ssn.tasks_status.get(&common::apis::TaskState::Running).copied().unwrap_or(0);
let needed = pending + running;
Some(total > 0 && total % state.batch_size == 0 && (total >= needed || state.pipelined > 0))
}

Comment on lines 100 to 104
fn is_fulfilled(&self, ssn: &SessionInfoPtr) -> Option<bool> {
let state = self.ssn_state.get(&ssn.id)?;
let total = state.allocated + state.bound;
Some(state.bound > 0 && total % state.batch_size == 0)
Some(total > 0 && total % state.batch_size == 0)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Critical Bug: Gang Sessions Cannot Dispatch Beyond the First Batch

Similarly to is_ready, the new formula for is_fulfilled (total > 0 && total % state.batch_size == 0) will return Some(true) at the start of the scheduling cycle if the session already has a completed batch of executors (e.g., allocated = 2 for batch_size = 2).

This causes DispatchAction::execute to skip dispatching idle executors to the session entirely:

if ctx.is_fulfilled(&ssn)? == Some(true) {
    continue; // Skips dispatch entirely!
}

As a result, the session can never bind more idle executors beyond the first batch.

Suggested Fix

Apply the same logic as is_ready by checking if the total executors (allocated + bound) cover the active tasks, or if we have bound new executors (bound > 0) and completed a batch.

    fn is_fulfilled(&self, ssn: &SessionInfoPtr) -> Option<bool> {
        let state = self.ssn_state.get(&ssn.id)?;
        let total = state.allocated + state.bound;
        let pending = ssn.tasks_status.get(&common::apis::TaskState::Pending).copied().unwrap_or(0);
        let running = ssn.tasks_status.get(&common::apis::TaskState::Running).copied().unwrap_or(0);
        let needed = pending + running;
        Some(total > 0 && total % state.batch_size == 0 && (total >= needed || state.bound > 0))
    }

@k82cn

k82cn commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

The root fix is changing PluginManager::is_ready and is_fulfilled to return Option instead of bool:

  • None : no plugin has an opinion (gang not loaded); callers treat this as "no batch constraint — 1 op per cycle is enough"
  • Some(false): a plugin explicitly says the batch is incomplete
  • Some(true) : all opinionated plugins agree the batch is complete

Can we keep bool? for is_ready/is_fulfilled in each plugin, None means this plugin should be ignore.

@jinzhejz jinzhejz force-pushed the bug_fix branch 3 times, most recently from 5450736 to be7d549 Compare June 12, 2026 04:05
@jinzhejz

Copy link
Copy Markdown
Author

The root fix is changing PluginManager::is_ready and is_fulfilled to return Option instead of bool:

  • None : no plugin has an opinion (gang not loaded); callers treat this as "no batch constraint — 1 op per cycle is enough"
  • Some(false): a plugin explicitly says the batch is incomplete
  • Some(true) : all opinionated plugins agree the batch is complete

Can we keep bool? for is_ready/is_fulfilled in each plugin, None means this plugin should be ignore.

fixed it

@jinzhejz jinzhejz force-pushed the bug_fix branch 9 times, most recently from 4fcf1bf to 1b5d5f8 Compare June 16, 2026 09:24
…ch semantics

Three interrelated bugs prevented correct executor allocation depending
on which scheduler plugins were configured.

Bug 1 — executors never allocated without gang (DRF-only, priority-only)

  PluginManager aggregated is_ready/is_fulfilled with unwrap_or(true),
  meaning when no plugin had a batch opinion the result was true.
  AllocateAction's pre-check (`if fulfilled || ready { skip }`) and
  DispatchAction's commit condition (`if fulfilled { commit }`) therefore
  skipped or no-op'd every session, leaving tasks permanently pending.

  Fix: Plugin trait defaults for is_ready/is_fulfilled changed from false
  to true (no opinion = do not block).  PluginManager aggregates with
  all() — returns true only when every plugin agrees; a single false vetos.
  When no gang plugin is loaded, PluginManager tracks per-session op counts
  (no_gang_alloc_ops / no_gang_bind_ops) and returns false before the first
  op and true after, so callers use a uniform `if is_ready() { break; }`
  without needing to know whether gang is configured.

Bug 2 — gang only ever created 1 executor regardless of task count

  The old readiness formula `total % batch_size == 0` was true at
  every multiple, so after one allocation (total=1, batch_size=1) the
  loop broke and no further executors were created for the remaining
  pending tasks.

  Fix: introduce `incomplete_tasks` in GangState (sampled once per cycle
  as Pending + Running task count) and compute:
    needed = div_ceil(incomplete_tasks, batch_size) * batch_size
    ready  = needed == 0 || total >= needed
  This creates enough executors to cover all incomplete tasks rounded up
  to the nearest full batch per scheduling cycle.

Bug 3 — duplicate executor creation when Dispatch and Allocate ran in the
  same cycle for a gang session

  DispatchAction lacked a pre-session guard: a session whose executor
  was still in Binding state (counted in Gang's `allocated`) received
  additional bindings every cycle.

  Fix: check is_fulfilled before entering the bind loop; skip the session
  if it is already satisfied.

Additional changes:

- Remove "drf" from DEFAULT_POLICIES; DRF remains opt-in via
  cluster.policies. Default stack is now priority + gang + shim.
- Allow listing "shim" in cluster.policies without error (log warning,
  ignore); shim is always-on.
- Regression tests added for all affected configurations:
    · drf-only and priority-only: executor is allocated
    · gang batch_size=1: one executor per incomplete task per cycle
    · gang batch_size=2: exactly two executors per cycle
    · no-gang: at most one executor per cycle
    · dispatch without gang: idle executor is bound
    · plugin trait default aggregation edge cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiple executors when only one task in the session

2 participants