[docs] update megatron docs#9350
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the documentation to include warnings about moe_aux_loss behavior with padding tokens in older megatron-core versions and modifies megatron_args.py to support mlp_padding_free with context parallelism by requiring mcore-bridge >= 1.4.0.dev. Reviewers noted that the version requirement for mcore-bridge was incorrectly placed under the moe_aux_loss_coeff documentation and should be moved to the mlp_padding_free section for clarity.
| - 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡,请根据实验效果,选择合适的值。 | ||
| - 注意:moe_aux_loss在`padding_free`为 False 的情况下,"megatron-core<0.16"存在对padding token计算routing loss的情况,参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev",参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)。 |
There was a problem hiding this comment.
The note regarding mcore-bridge>=1.4.0.dev and the link to the context parallel (CP) support PR (PR 79) is misplaced under the moe_aux_loss_coeff parameter. Based on the code changes in megatron_args.py, this requirement is specifically for mlp_padding_free when used with CP.
Additionally, please use backticks (`) instead of double quotes for version numbers and package names to maintain consistency with the rest of the documentation.
Consider moving the bridge version requirement to the mlp_padding_free section (around line 66) and updating the "Temporarily incompatible with CP" statement there.
| - 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡,请根据实验效果,选择合适的值。 | |
| - 注意:moe_aux_loss在`padding_free`为 False 的情况下,"megatron-core<0.16"存在对padding token计算routing loss的情况,参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。此外请使用"mcore-bridge>=1.4.0.dev",参考[这个PR](https://github.com/modelscope/mcore-bridge/pull/79)。 | |
| - 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡,请根据实验效果,选择合适的值。 | |
| - 注意:moe_aux_loss在`padding_free`为 False 的情况下,`megatron-core < 0.16`存在对padding token计算routing loss的情况,参考[这个PR](https://github.com/NVIDIA/Megatron-LM/pull/2142)。 |
| - 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results. | ||
| - Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79). |
There was a problem hiding this comment.
The addition of the mcore-bridge>=1.4.0.dev requirement and the link to the CP support PR (PR 79) under moe_aux_loss_coeff is confusing. This version requirement is related to mlp_padding_free support for context parallelism, not the auxiliary loss coefficient.
Consider moving this information to the mlp_padding_free section (around line 69) and updating the note about CP incompatibility there.
| - 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results. | |
| - Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). Additionally, please use `mcore-bridge >= 1.4.0.dev`. Please refer to [this PR](https://github.com/modelscope/mcore-bridge/pull/79). | |
| - 🔥moe_aux_loss_coeff: Defaults to 0, meaning the auxiliary loss is not used. Generally, a higher value leads to worse training performance but more balanced MoE expert utilization. Please choose an appropriate value based on experimental results. | |
| - Note: For `moe_aux_loss`, when `padding_free` is set to False, versions of `megatron-core < 0.16` have an issue where routing loss is computed on padding tokens. Please refer to [this PR](https://github.com/NVIDIA/Megatron-LM/pull/2142). |
No description provided.