[docs] update megatron docs by Jintao-Huang · Pull Request #9350 · modelscope/ms-swift

Jintao-Huang · 2026-05-15T06:55:56Z

No description provided.

gemini-code-assist

Code Review

This pull request updates the documentation to include warnings about moe_aux_loss behavior with padding tokens in older megatron-core versions and modifies megatron_args.py to support mlp_padding_free with context parallelism by requiring mcore-bridge >= 1.4.0.dev. Reviewers noted that the version requirement for mcore-bridge was incorrectly placed under the moe_aux_loss_coeff documentation and should be moved to the mlp_padding_free section for clarity.

gemini-code-assist · 2026-05-15T06:57:13Z

+- 🔥moe_aux_loss_coeff: 默认为0，不使用aux_loss。通常情况下，该值设置的越大，训练效果越差，但MoE负载越均衡，请根据实验效果，选择合适的值。
+  - 注意：moe_aux_loss在`padding_free`为 False 的情况下，"megatron-core<0.16"存在对padding token计算routing loss的情况，参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev"，参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)。


The note regarding mcore-bridge>=1.4.0.dev and the link to the context parallel (CP) support PR (PR 79) is misplaced under the moe_aux_loss_coeff parameter. Based on the code changes in megatron_args.py, this requirement is specifically for mlp_padding_free when used with CP.

Additionally, please use backticks (`) instead of double quotes for version numbers and package names to maintain consistency with the rest of the documentation.

Consider moving the bridge version requirement to the mlp_padding_free section (around line 66) and updating the "Temporarily incompatible with CP" statement there.

Suggested change

- 🔥moe_aux_loss_coeff: 默认为0，不使用aux_loss。通常情况下，该值设置的越大，训练效果越差，但MoE负载越均衡，请根据实验效果，选择合适的值。

- 注意：moe_aux_loss在`padding_free`为 False 的情况下，"megatron-core<0.16"存在对padding token计算routing loss的情况，参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev"，参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)。

- 🔥moe_aux_loss_coeff: 默认为0，不使用aux_loss。通常情况下，该值设置的越大，训练效果越差，但MoE负载越均衡，请根据实验效果，选择合适的值。

- 注意：moe_aux_loss在`padding_free`为 False 的情况下，`megatron-core < 0.16`存在对padding token计算routing loss的情况，参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。

gemini-code-assist · 2026-05-15T06:57:13Z

+- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.
+  - Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79).


The addition of the mcore-bridge>=1.4.0.dev requirement and the link to the CP support PR (PR 79) under moe_aux_loss_coeff is confusing. This version requirement is related to mlp_padding_free support for context parallelism, not the auxiliary loss coefficient.

Consider moving this information to the mlp_padding_free section (around line 69) and updating the note about CP incompatibility there.

Suggested change

- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.

- Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79).

- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.

- Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142).

Jintao-Huang · 2026-05-15T07:02:39Z

NVIDIA/Megatron-LM#2142

modelscope/mcore-bridge#79

update megatron docs

7d1cf44

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Jintao-Huang merged commit cdb20f6 into modelscope:main May 15, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] update megatron docs#9350

[docs] update megatron docs#9350
Jintao-Huang merged 1 commit into
modelscope:mainfrom
Jintao-Huang:update_megatron_docs_0514

Jintao-Huang commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

Jintao-Huang commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- 🔥moe_aux_loss_coeff: 默认为0，不使用aux_loss。通常情况下，该值设置的越大，训练效果越差，但MoE负载越均衡，请根据实验效果，选择合适的值。
		- 注意：moe_aux_loss在`padding_free`为 False 的情况下，"megatron-core<0.16"存在对padding token计算routing loss的情况，参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev"，参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)。

		- 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results.
		- Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79).

Conversation

Jintao-Huang commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant