Summary
Today there are two replay logic supported - one for only comms and one for compute+comm. This has several drawbacks
- Code exists in two directories - compute and comms
- Changes in one side are not testing compatibility with the other.
- @TaekyungHeo has observed bugs and crashes with the replay when both compute and comms is enabled.
Crashes
@TaekyungHeo to add more info on how to reproduce issues
Code unification
Basic idea is to pull things out to a replay directory and unify the code
Details TBD
Integration testing
Ensure changes are unit tested to avoid impact to external users.
Summary
Today there are two replay logic supported - one for only comms and one for compute+comm. This has several drawbacks
Crashes
@TaekyungHeo to add more info on how to reproduce issues
Code unification
Basic idea is to pull things out to a replay directory and unify the code
Details TBD
Integration testing
Ensure changes are unit tested to avoid impact to external users.