Conversation
Co-authored-by: Dev Patel <dev.patel@berkeley.edu>
…nning with tp + ep for megatron
|
Current State: For small scale tests, routing replay seems to be working as shown above - tested only for TP=8 serving and TP=4 and EP=8 training. What's not working:
TODOs:
|
Solved! the issue was that in our test we were setting vLLM logprobs - mean: -2.564655
Megatron (replay) - mean: -9.231623
Megatron (no rep) - mean: -9.647593after setting vLLM logprobs - mean: -0.223607, std: 0.674102
Megatron (replay) - mean: -0.223626, std: 0.674850
Megatron (no rep) - mean: -0.224379, std: 0.677036
With replay - logprob diff mean: 0.006648, std: 0.021737
Without replay - logprob diff mean: 0.011115, std: 0.035957 |
verified that cherry picking the changes from #1300 to use the mp backend allow us to work around the compiled graph timeout
|



Overview
This PR adds support for Rollout Routing Replay (R3) from (See Paper).
See #815 for tracking of future tasks to fully support routing replay in all settings.
We add the following flags to enable R3:
cfg.generator.inference_engine.enable_return_routed_experts=Trueis a pass through argument to vLLM, which records expert router indices (returning a list of dimension(batch_size, seq_len, num_layers, top_k).We then pass this list
rollout_expert_indiceslist through to Megatron's nativeRouterReplayfeature (link).When
cfg.trainer.policy.megatron_config.moe_enable_routing_replayis set totrue, Megatron initializes an instance ofRouterReplayon each training worker rank.RouterReplay.set_replay_data(per_layer_data)can be used to set router decisions, andRouterReplay.set_global_router_replay_action(RouterReplayAction.REPLAY_FORWARD)andRouterReplay.set_global_router_replay_action(RouterReplayAction.REPLAY_BACKWARD)can be used to set the routing mode to be forward or backward.Supported Settings
Router Replay is supported for the following settings:
Generator Settings
use_conversation_multi_turn=Trueanduse_conversation_multi_turn=Falsebatched=Falseandbatched=Trueasync_engine=Trueandasync_engine=Falseretokenize_chat_historymode - i.e.self.use_conversation_multi_turn and self.custom_chat_templateself.generator_cfg.step_wise_trajectories- this should be possible to add but is not currently added/tested.step_wise_trainingInference Engine Settings
_SKYRL_USE_NEW_INFERENCEis not supported - this will be added in a follow up PRcfg.generator.distributed_executor_backendmust be set tomp- hanging related to a Ray Compiled Graph issue occurs when using the defaultrayvLLM distributed executor backend. (see [Bug]: Generation hangs until RAY_CGRAPH_get_timeout (300s) with Ray compiled DAG executor vllm-project/vllm#36237 for details on the error that comes up)mpalso means that serving must be single node per engine, until we add support for using the mp backend with multi-node serving - progress tracked here: [vllm] Fully enable mp distributed executor backend in vLLM #1309Trainer Settings
Custom Generator support
Rollout Routing Replay
Relevant resources:
vLLM PR: vllm-project/vllm#28284
Verl PR: verl-project/verl#4101
Mindlab blog: https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it
Megatron-LM API guide: https://github.com/NVIDIA/Megatron-LM/blob/main/docs/api-guide/router_replay.md