Skip to content

TransferBench v1.67.0#273

Open
nileshnegi wants to merge 2 commits intodevelopfrom
merge/TransferBench-v1.67.0
Open

TransferBench v1.67.0#273
nileshnegi wants to merge 2 commits intodevelopfrom
merge/TransferBench-v1.67.0

Conversation

@nileshnegi
Copy link
Copy Markdown
Collaborator

@nileshnegi nileshnegi commented Apr 27, 2026

Motivation

TransferBench v1.67.0 release

Technical Details

Test Plan

Test Result

Submission Checklist

@nileshnegi nileshnegi requested a review from a team as a code owner April 27, 2026 05:43
Copilot AI review requested due to automatic review settings April 27, 2026 05:43
@nileshnegi nileshnegi requested a review from a team as a code owner April 27, 2026 05:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Release PR for TransferBench v1.67.0, adding new presets and pod-aware multi-rank capabilities alongside build-system and usability improvements.

Changes:

  • Adds multiple new presets (pod p2p/a2a, hbm, gfx/a2a sweeps, wallclock, smoketest, bmasweep) and expands preset/help/envvar UX.
  • Introduces/extends pod detection/grouping utilities and uniformity checks across ranks.
  • Modernizes build configuration (CMake + Makefile feature probes/flags) and updates docs/changelog for the release.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
toolchain-linux.cmake Removes legacy CMake toolchain file (logic moved into CMakeLists).
src/client/Utilities.hpp Updates rank grouping to use pod index, adds rank-per-pod map, adds uniformity helper/macros, and table sizing fixes.
src/client/Topology.hpp Adjusts multi-rank topology display to show POD index and updated columns.
src/client/Presets/WallClock.hpp Adds new wallclock preset for XCC wallclock consistency detection.
src/client/Presets/Sweep.hpp Preset signature updated to include bytesSpecified.
src/client/Presets/SmokeTest.hpp Adds smoketest correctness preset spanning DMA/GFX operations.
src/client/Presets/Schmoo.hpp Preset signature updated to include bytesSpecified.
src/client/Presets/Scaling.hpp Updates scaling preset to use CPU/GPU mem type env vars and deprecates USE_FINE_GRAIN.
src/client/Presets/Presets.hpp Adds new presets, new preset listing output, and passes bytesSpecified into presets.
src/client/Presets/PodPeerToPeer.hpp Adds pod-aware peer-to-peer bandwidth preset.
src/client/Presets/PodAllToAll.hpp Adds pod-aware all-to-all preset with grouping/stride scheduling.
src/client/Presets/PeerToPeer.hpp Preset signature updated to include bytesSpecified.
src/client/Presets/OneToAll.hpp Preset signature updated to include bytesSpecified.
src/client/Presets/NicRings.hpp Preset signature updated; minor numeric_limits fix; error message env var rename.
src/client/Presets/NicPeerToPeer.hpp Preset signature updated; minor formatting/spaces; error message env var rename.
src/client/Presets/Help.hpp Adds help preset describing transfer/config formats and examples.
src/client/Presets/HealthCheck.hpp Preset signature updated to include bytesSpecified.
src/client/Presets/HbmBandwidth.hpp Adds hbm preset to sweep/read HBM bandwidth with wallclock/event timing.
src/client/Presets/GfxSweep.hpp Adds gfxsweep preset to sweep GFX kernel parameters for a transfer.
src/client/Presets/EnvVarsList.hpp Adds envvars preset to print environment variable list.
src/client/Presets/BmaSweep.hpp Adds bmasweep preset comparing DMA vs batched DMA executor.
src/client/Presets/AllToAllSweep.hpp Refactors a2asweep output formatting and options (MEM_TYPE, NUM_SUB_EXECS, timing mode).
src/client/Presets/AllToAllN.hpp Preset signature updated to include bytesSpecified.
src/client/Presets/AllToAll.hpp Preset signature updated to include bytesSpecified.
src/client/EnvVars.hpp Adds NIC CQ poll batch env var, expands env var listing, and adds string-array env parsing helper.
src/client/Client.cpp Updates default CLI behavior and usage text to reference new help/envvars/presets commands and multi-rank usage.
examples/example.cfg Updates documentation to include new executors (Batched DMA).
Makefile Improves compiler detection, adds feature probes (NIC/MPI/POD/NVML/AMD-SMI), and clarifies build output.
CMakeLists.txt Modernizes CMake (min version, ROCm detection, feature probes, options for NIC/DMA-BUF/POD/AMD-SMI, target linking).
CHANGELOG.md Adds v1.67.00 release notes covering new presets/features and behavior changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/Topology.hpp
Comment thread src/client/Presets/WallClock.hpp
Comment thread src/client/Presets/WallClock.hpp
Comment thread src/client/EnvVars.hpp Outdated
Comment thread src/client/Presets/AllToAllSweep.hpp Outdated
Comment thread src/client/Presets/AllToAllSweep.hpp Outdated
Comment thread src/client/Topology.hpp
Copy link
Copy Markdown
Contributor

@AtlantaPepsi AtlantaPepsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate cuMem compilation from pod enablement

Comment thread Makefile
Comment thread Makefile Outdated
Comment thread Makefile Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp
@AtlantaPepsi AtlantaPepsi dismissed their stale review April 30, 2026 06:39

Merged into candidate branch

Copilot AI review requested due to automatic review settings May 2, 2026 05:23
@nileshnegi nileshnegi force-pushed the merge/TransferBench-v1.67.0 branch from e5d151b to 6249ec6 Compare May 2, 2026 05:23
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@nileshnegi nileshnegi force-pushed the merge/TransferBench-v1.67.0 branch from 6249ec6 to 29efe12 Compare May 2, 2026 05:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/Presets/Rings.hpp Outdated
Comment on lines +116 to +118
std::vector<std::vector<std::vector<uint64_t>>> results(numGpuDevices,
std::vector<std::vector<uint64_t>>(ev.numIterations,
std::vector<uint64_t>(numXccs, 0)));
int useDmaExec = EnvVars::GetEnvVar("USE_DMA_EXEC" , 0);
int useRemoteRead = EnvVars::GetEnvVar("USE_REMOTE_READ", 0);
int stride = EnvVars::GetEnvVar("STRIDE" , 1);
int groupSize = EnvVars::GetEnvVar("GROUP_SIZE" , numRanks * numDetectedGpus);
Comment on lines +121 to +123
if (numRanks * numDetectedGpus % groupSize) {
Utils::Print("[ERROR] Group size %d cannot evenly divide %d total devices from %d ranks.\n", groupSize, numRanks * numDetectedGpus, numRanks);
return ERR_FATAL;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants