Skip to content

[docs] update megatron docs#9350

Merged
Jintao-Huang merged 1 commit into
modelscope:mainfrom
Jintao-Huang:update_megatron_docs_0514
May 15, 2026
Merged

[docs] update megatron docs#9350
Jintao-Huang merged 1 commit into
modelscope:mainfrom
Jintao-Huang:update_megatron_docs_0514

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation to include warnings about moe_aux_loss behavior with padding tokens in older megatron-core versions and modifies megatron_args.py to support mlp_padding_free with context parallelism by requiring mcore-bridge >= 1.4.0.dev. Reviewers noted that the version requirement for mcore-bridge was incorrectly placed under the moe_aux_loss_coeff documentation and should be moved to the mlp_padding_free section for clarity.

Comment on lines +195 to +196
- 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡,请根据实验效果,选择合适的值。
- 注意:moe_aux_loss在`padding_free`为 False 的情况下,"megatron-core<0.16"存在对padding token计算routing loss的情况,参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev",参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)。
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The note regarding mcore-bridge>=1.4.0.dev and the link to the context parallel (CP) support PR (PR 79) is misplaced under the moe_aux_loss_coeff parameter. Based on the code changes in megatron_args.py, this requirement is specifically for mlp_padding_free when used with CP.

Additionally, please use backticks (`) instead of double quotes for version numbers and package names to maintain consistency with the rest of the documentation.

Consider moving the bridge version requirement to the mlp_padding_free section (around line 66) and updating the "Temporarily incompatible with CP" statement there.

Suggested change
- 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡,请根据实验效果,选择合适的值。
- 注意:moe_aux_loss在`padding_free`为 False 的情况下,"megatron-core<0.16"存在对padding token计算routing loss的情况,参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev",参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)
- 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡,请根据实验效果,选择合适的值。
- 注意:moe_aux_loss在`padding_free`为 False 的情况下,`megatron-core < 0.16`存在对padding token计算routing loss的情况,参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)

Comment on lines +206 to +207
- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.
- Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of the mcore-bridge>=1.4.0.dev requirement and the link to the CP support PR (PR 79) under moe_aux_loss_coeff is confusing. This version requirement is related to mlp_padding_free support for context parallelism, not the auxiliary loss coefficient.

Consider moving this information to the mlp_padding_free section (around line 69) and updating the note about CP incompatibility there.

Suggested change
- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.
- Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79).
- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.
- Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142).

@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

@Jintao-Huang Jintao-Huang merged commit cdb20f6 into modelscope:main May 15, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant