Skip to content

feat: add queue-level scheduling policy#4380

Open
TriangleWuYQ wants to merge 1 commit into
volcano-sh:masterfrom
TriangleWuYQ:queue-level-scheduling
Open

feat: add queue-level scheduling policy#4380
TriangleWuYQ wants to merge 1 commit into
volcano-sh:masterfrom
TriangleWuYQ:queue-level-scheduling

Conversation

@TriangleWuYQ

Copy link
Copy Markdown
Contributor

What type of PR is this?

Add queue-level scheduling policies

What this PR does / why we need it:

This PR introduces queue-level scheduling policy. As outlined in the implementation section of the design document(#4163), it enables configuring different Actions and Plugins per queue using the SchedulingPolicy. This design also allows for straightforward extension to job-level policies in the future.

Note:

Which issue(s) this PR fixes:

Fixes ##3992

Special notes for your reviewer:

@Monokaix
@JesseStutler

Does this PR introduce a user-facing change?


@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 17, 2025
@JesseStutler

Copy link
Copy Markdown
Member

/cc

@googs1025

Copy link
Copy Markdown
Member

/cc
will review this week 😄

@volcano-sh-bot volcano-sh-bot requested a review from googs1025 June 18, 2025 07:33
@Monokaix Monokaix requested a review from Copilot June 18, 2025 08:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for queue-level scheduling policies by introducing a SchedulingPolicy construct that allows configuring actions and plugins per queue. Key changes include:

  • Extending the scheduler to load multiple configuration files (global vs. per-queue), storing them in schedulingPolicies, and passing them into each session.
  • Enhancing Session and framework.OpenSession to register policy-specific plugins/actions, and updating action execution loops to respect per-job or global policy.
  • Adding utility methods (UnionActions, getSchedulingPolicyConf, feature-gated checks) and updating tests to account for the new signature of OpenSession.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/scheduler/scheduler.go Load per-queue config files, manage schedulingPolicies, pass into sessions, implement UnionActions.
pkg/scheduler/framework/session.go Add SchedulingPolicy and PluginRegistry to Session, implement getters and policy-based HasAction.
pkg/scheduler/framework/framework.go Update OpenSession/CloseSession to initialize/teardown policy plugins.
pkg/scheduler/framework/util.go Add helpers to copy and clear session functions into policies.
pkg/features/volcano_features.go Introduce SchedulingPolicy feature gate.
cmd/scheduler/app/options/options.go Change --scheduler-conf to accept a directory.
Various *_test.go Updated OpenSession calls to new signature.
Comments suppressed due to low confidence (3)

pkg/scheduler/scheduler.go:286

  • [nitpick] Consider adding unit tests for UnionActions, verifying that global and policy-specific actions merge correctly under various feature-gate and policy scenarios.
func (pc *Scheduler) UnionActions() []framework.Action {

pkg/scheduler/scheduler.go:222

  • [nitpick] Add a doc comment explaining the purpose and return values of getSchedulingPolicyConf to clarify what the two maps represent.
func (pc *Scheduler) getSchedulingPolicyConf() (map[string][]string, map[string][]string) {

pkg/scheduler/scheduler.go:53

  • [nitpick] Rename PolicyMutex to schedulingPoliciesMutex (or similar) to clearly associate it with the schedulingPolicies field.
	PolicyMutex        sync.Mutex

Comment on lines +113 to 115
schedulingPolicies := pc.schedulingPolicies
pc.mutex.Unlock()

Copilot AI Jun 18, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access to pc.schedulingPolicies here is not protected by PolicyMutex, leading to a potential race. Consider locking PolicyMutex around reads (and writes) of schedulingPolicies.

Suggested change
schedulingPolicies := pc.schedulingPolicies
pc.mutex.Unlock()
pc.mutex.Unlock()
pc.PolicyMutex.Lock()
schedulingPolicies := pc.schedulingPolicies
pc.PolicyMutex.Unlock()

Copilot uses AI. Check for mistakes.
func (pc *Scheduler) getSchedulingPolicyConf() (map[string][]string, map[string][]string) {
policyActions := make(map[string][]string)
policyPlugins := make(map[string][]string)

Copilot AI Jun 18, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over pc.schedulingPolicies without holding PolicyMutex can race with concurrent updates. Acquire the mutex before looping through the map.

Suggested change
pc.PolicyMutex.Lock()
defer pc.PolicyMutex.Unlock()

Copilot uses AI. Check for mistakes.
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2025
@TriangleWuYQ TriangleWuYQ force-pushed the queue-level-scheduling branch from 80e19dd to 6e039f8 Compare June 20, 2025 11:30
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2025
@TriangleWuYQ TriangleWuYQ force-pushed the queue-level-scheduling branch from 5889ace to da8eb06 Compare June 20, 2025 11:54
@TriangleWuYQ

Copy link
Copy Markdown
Contributor Author

Made the following changes based on the weekly meeting discussion:

  • Moved the assignment of jobinfo.schedulingPolicy, taskinfo.schedulingPolicy, queueinfo.schedulingPolicy to the openSession phase.
  • Updated the logging related to reading the SchedulingPolicy.

Additionally, it was found that using DefaultSchedulingPolicy conflicts with the extension points in ssn, therefore, the global configuration is still kept in ssn.

@kingeasternsun

Copy link
Copy Markdown
Contributor

Great work!

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2025
@TriangleWuYQ

Copy link
Copy Markdown
Contributor Author

/assign @JesseStutler
/assign @Monokaix

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2025
@Monokaix

Copy link
Copy Markdown
Member

Please rebase master and solve code conflict.

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign monokaix
You can assign the PR to them by writing /assign @monokaix in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2025
Signed-off-by: Yuqi Wu <wuyuqi22@mails.ucas.ac.cn>
}

// GetSchedulingPolicyFromJob return the SchedulingPolicy from the job.
func GetSchedulingPolicyFromJob(job *api.JobInfo) *SchedulingPolicy {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the mutable nature of SchedulingPolicy, GetSchedulingPolicyFromTask must retrieve the most up-to-date policy from the Job's associated queue on each call. An alternative is to refresh job.SchedulingPolicy with the current Queue.SchedulingPolicy when OpenSession is invoked.
The same to GetSchedulingPolicyFromQueue and GetSchedulingPolicyFromTask.

If we opt to reset xxx.SchedulingPolicy during OpenSession, it's crucial to set queue.SchedulingPolicy first, followed by jobInfo.SchedulingPolicy and then taskInfo.SchedulingPolicy.

What's your take on this? @JesseStutler .

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2025
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@ElectricFish7: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kingeasternsun

Copy link
Copy Markdown
Contributor

Hi @DavidYQWu , Thanks for your work, could u continue finish this ?

@kingeasternsun kingeasternsun left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work.

@TriangleWuYQ

Copy link
Copy Markdown
Contributor Author

Hi @DavidYQWu , Thanks for your work, could u continue finish this ?你好,感谢你的工作,你能继续写完吗?

I’ll address the conflicts and add test files, but could you take a look at the implementation in this PR first and let me know if you notice any issues? I can fix them together with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants