-
Notifications
You must be signed in to change notification settings - Fork 132
Run benchmarks in parallel with tests, not sequentially #1130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Move benchmark jobs from bench.yml into test.yml so tests and benchmarks run in parallel instead of sequentially, reducing merge time from ~10hrs to ~5hrs. Add cross-cancellation jobs so a failure in either group cancels the entire workflow run. Docs-only PRs continue to skip both. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoved the standalone Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant GH as GitHub Actions (test.yml)
participant Repo as Repository (PR & Master refs)
participant Login as Login Node (build)
participant SLURM as SLURM Scheduler
participant Node as Compute Node (per-config)
participant Diff as bench_diff / Poster
rect rgba(93,156,236,0.5)
GH->>Repo: clone PR and master refs
end
rect rgba(96,169,79,0.5)
GH->>Login: run per-config builds (parallel)
Login-->>GH: build artifacts/logs
end
rect rgba(236,151,31,0.5)
GH->>SLURM: submit multi-node sbatch (run_frontier_all_*.sh)
SLURM->>Node: allocate nodes and SSH per-config tasks
Node->>Node: run frontier_test_config / frontier_bench_config
Node-->>SLURM: write per-config YAML outputs
SLURM-->>GH: job completion and collected outputs
end
rect rgba(178,120,255,0.5)
GH->>Diff: run bench_diff & generate comment
Diff-->>GH: diff results / comment payload
GH->>Repo: post comment & upload logs/artifacts
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ✏️ Tip: You can disable in-progress messages and the fortune message in your review settings. Tip CodeRabbit can generate a title for your PR based on the changes.Add ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
2 similar comments
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ✏️ Tip: You can disable in-progress messages and the fortune message in your review settings. Tip CodeRabbit can generate a title for your PR based on the changes.Add ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ✏️ Tip: You can disable in-progress messages and the fortune message in your review settings. Tip CodeRabbit can generate a title for your PR based on the changes.Add ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
CodeAnt AI finished reviewing your PR. |
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
3 similar comments
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 285-287: The bench job is missing the NODE_OPTIONS env used in the
self job and can cause OOMs on Phoenix self-hosted runners; update the bench
job's env block (the same place where ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION
and ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION are set) to include NODE_OPTIONS:
'--max-old-space-size=2048' so Phoenix bench runners get the same Node memory
limit as the self job.
🧹 Nitpick comments (2)
.github/workflows/test.yml (2)
301-306: Parallel build: first failure orphans the second background job.GitHub Actions' default shell uses
set -eo pipefail. Ifwait %1returns non-zero, the shell exits immediately andwait %2is never called — the second build process keeps running in the background on the self-hosted runner. While the step correctly fails, the orphaned process wastes resources.A more robust pattern waits for both jobs unconditionally and then checks both exit codes:
Proposed fix
- name: Setup & Build if: matrix.build_script != '' run: | (cd pr && ${{ matrix.build_script }}) & (cd master && ${{ matrix.build_script }}) & - wait %1 && wait %2 + wait -n 2 || true # suppress set -e for individual waits + wait -n 2 || true + wait %1; s1=$? + wait %2; s2=$? + if [ "$s1" -ne 0 ] || [ "$s2" -ne 0 ]; then exit 1; fiActually, a simpler approach since both jobs have already finished:
Simpler alternative
- name: Setup & Build if: matrix.build_script != '' run: | + set +e (cd pr && ${{ matrix.build_script }}) & (cd master && ${{ matrix.build_script }}) & - wait %1 && wait %2 + wait %1; s1=$? + wait %2; s2=$? + [ "$s1" -eq 0 ] && [ "$s2" -eq 0 ]
311-314:bench_diffstep missingif: always()orif: success()guard.If the "Bench (Master v. PR)" step fails, the "Generate & Post Comment" step is skipped by default (GitHub Actions skips subsequent steps on failure). If you want the diff comment posted even on partial benchmark failure (e.g., to report which benchmarks regressed), consider adding
if: always(). If the current behavior (skip on failure) is intended, this is fine as-is.
|
Caution Review failedFailed to post review comments 📝 WalkthroughWalkthroughThe benchmark workflow has been consolidated from a standalone file into the main test workflow. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Benchmark failures (e.g. transient network issues) should not cancel correctness tests. Keep only cancel-on-test-failure so that test failures still cancel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1130 +/- ##
=======================================
Coverage 44.03% 44.03%
=======================================
Files 70 70
Lines 20649 20649
Branches 2053 2053
=======================================
Hits 9093 9093
Misses 10368 10368
Partials 1188 1188 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Reduces Frontier queue depth from 11 SLURM jobs to 2 by consolidating test configs into a single 5-node allocation and bench configs into a single 6-node allocation. Builds remain on login nodes; tests/benchmarks run on compute nodes via ssh to allocated nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/scripts/frontier_bench_post.sh:
- Around line 34-35: The issue is that the module environment sourced by ".
./mfc.sh load -c \"$flag\" -m g" is lost because it runs in a separate subshell
from the subsequent "./mfc.sh bench_diff" call; fix it by running both commands
in the same subshell or sourcing the loader in the parent shell so the
environment persists — e.g., change the two separate "(cd pr && ...)"
invocations so they are combined into a single "(cd pr && . ./mfc.sh load -c
\"$flag\" -m g && ./mfc.sh bench_diff \"../$master_yaml\" \"../$pr_yaml\")" (or
alternatively run "cd pr" then source "mfc.sh load" in the current shell before
invoking "./mfc.sh bench_diff"), ensuring the loader (./mfc.sh load) and
bench_diff run in the same shell/session.
In @.github/scripts/frontier_test_config.sh:
- Around line 28-34: The GPU probe using rocm-smi (the gpus and ngpus
assignments) is executed unconditionally and will cause CPU-only runs to fail;
move the rocm-smi call and the ngpus calculation inside the branch that runs
when device equals "gpu". Specifically, relocate the lines that set
gpus=$(rocm-smi ...) and ngpus=$(...) into the if [ "$device" = "gpu" ] branch
before invoking ./mfc.sh, and add a safe fallback for ngpus (e.g., default to 1
or exit with a clear error) if rocm-smi yields no results; ensure you reference
the same variable names (gpus, ngpus, device) and preserve the existing ./mfc.sh
invocation when fixing.
In @.github/workflows/test.yml:
- Around line 270-275: The "Setup & Build" workflow step is dead because
matrix.build_script is set to the empty string for every matrix entry, so the
conditional if: matrix.build_script != '' is never true; either remove the
entire "Setup & Build" step from the workflow or populate the matrix entries
with non-empty build_script values for the matrix entries that need building so
the condition can be true, and update any references to that step accordingly
(look for the job name/step "Setup & Build" and the matrix variable
matrix.build_script).
- Around line 310-319: The workflow only implements one-directional cancellation
via the job named cancel-on-test-failure; to make cancellation bidirectional add
a symmetric job (e.g., cancel-on-bench-failure) that uses needs: [bench], if:
failure(), runs-on: ubuntu-latest and the same steps (Cancel Workflow Run using
gh run cancel ${{ github.run_id }} --repo ${{ github.repository }} with GH_TOKEN
env) so benchmarks failing will cancel the tests; ensure the new job mirrors
cancel-on-test-failure but references needs: [bench] and an appropriate name and
position in the YAML.
🧹 Nitpick comments (4)
.github/scripts/frontier_bench_config.sh (1)
29-33: Quote$(nproc)to prevent word splitting (ShellCheck SC2046).While
nprocreturns a single integer in practice, quoting the command substitution is the safer shell idiom and silences the linter.Proposed fix
- ./mfc.sh bench --mem 1 -j $(nproc) -o "$job_slug.yaml" -- -c "$cluster" $device_opts -n $n_ranks + ./mfc.sh bench --mem 1 -j "$(nproc)" -o "$job_slug.yaml" -- -c "$cluster" $device_opts -n $n_ranks.github/scripts/run_frontier_all_benchmarks.sh (2)
78-86: Duplicated config table inside the heredoc is a maintenance hazard.The config array on lines 79–86 must be kept in exact sync with lines 15–22. If someone adds, removes, or reorders a config in one place but not the other, the node-to-config mapping silently breaks (wrong benchmark runs on wrong node, or index out of bounds). The same pattern appears in
run_frontier_all_tests.shlines 81–88.Consider generating the inner config table dynamically (e.g., write it to a file in Phase 1 and source it inside the heredoc), or at minimum add a CI-time assertion that both arrays match.
111-118:$?after a failedif waitmay not carry the original exit code in all shells.In bash,
$?in theelsebranch ofif wait "$pid"correctly reflects the child's exit status. However, this is a bash-specific behavior — the shebang (#!/bin/bash) ensures it, but worth a brief comment for future maintainers, since it's a subtle pattern..github/scripts/run_frontier_all_tests.sh (1)
56-56: Hardcoded "5-node" in the echo; the benchmarks script uses$num_nodes.Minor inconsistency —
run_frontier_all_benchmarks.shline 53 uses"Submitting ${num_nodes}-node SLURM job..."while this script hardcodes the number. Using$num_nodeskeeps it self-documenting if configs change.Proposed fix
-echo "All builds complete. Submitting 5-node SLURM job..." +echo "All builds complete. Submitting ${num_nodes}-node SLURM job..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 issues found across 6 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/scripts/frontier_test_config.sh">
<violation number="1" location=".github/scripts/frontier_test_config.sh:28">
P1: The `rocm-smi` command is invoked unconditionally before the device type check. With `set -e` enabled, this will cause CPU-only tests to fail immediately on nodes without ROCm installed, as the script will terminate before reaching the `else` branch. Move the GPU discovery logic inside the `if [ "$device" = "gpu" ]` block.</violation>
</file>
<file name=".github/scripts/run_frontier_all_benchmarks.sh">
<violation number="1" location=".github/scripts/run_frontier_all_benchmarks.sh:140">
P1: The job ID extraction using `grep -oE '[0-9]+'` is fragile and can incorrectly match numbers from error messages (e.g., account names like 'ENG160' or error codes). This could cause the script to proceed with an invalid job ID when `sbatch` actually failed. Parse only the specific 'Submitted batch job' line to ensure a real job ID is captured.</violation>
</file>
<file name=".github/scripts/frontier_bench_config.sh">
<violation number="1" location=".github/scripts/frontier_bench_config.sh:13">
P2: `frontier_bench_config.sh` always loads GPU mode (`-m g`) even when `device=cpu`, unlike the test config. CPU benchmark runs will get the wrong module set; mirror the device-based mode selection.</violation>
</file>
<file name=".github/scripts/frontier_bench_post.sh">
<violation number="1" location=".github/scripts/frontier_bench_post.sh:35">
P1: The environment setup and `bench_diff` invocation are in separate subshells, so the environment changes from `. ./mfc.sh load` are not visible to `bench_diff`. Both commands should run in the same subshell to ensure the loaded environment is applied.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Use rsync with --link-dest and --exclude to avoid copying the target directories back into themselves when creating per-config source copies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move rocm-smi GPU discovery inside the gpu branch so CPU configs don't fail with set -e (frontier_test_config.sh) - Combine mfc.sh load and bench_diff into single subshell so loaded module environment persists (frontier_bench_post.sh) - Use device-based mode selection in bench config for consistency (frontier_bench_config.sh) - Parse 'Submitted batch job' line instead of bare grep for digits (run_frontier_all_tests.sh, run_frontier_all_benchmarks.sh) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents rapid pushes from cancelling unrelated jobs. Each job type (github, self, bench) now has its own concurrency group keyed by ref and matrix params, so a new push only cancels stale instances of the same job — not all jobs across the workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/test.yml">
<violation number="1" location=".github/workflows/test.yml:68">
P2: The concurrency group for the github matrix job omits matrix.precision, so the default-precision job and the `single` precision job share the same group and can cancel each other when `cancel-in-progress: true` is enabled. Include precision in the concurrency key to avoid unintended cancellations and preserve parallelism.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Eliminate config table duplication by writing configs to a file and reading it inside the sbatch heredoc (now unquoted for variable expansion) - Add EXIT trap in sbatch scripts to kill orphaned SSH processes on job cancellation - Add per-config timeout (90 min tests, 120 min benchmarks) to prevent a single hanging config from consuming the full walltime - Extract SLURM account, partition, walltime, and node count into variables at the top of each orchestration script - Validate GPU count from rocm-smi (1-16 range) with diagnostic output on failure in both test and benchmark compute-node scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Running all 5-6 builds simultaneously on a Frontier login node causes memory pressure that crashes the Cray compiler (optcg segfault). Cap concurrent builds at 2 using a polling semaphore to stay within login node resource limits. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Hardlink copies (cp -al / rsync --link-dest) share the same .git/index inode across all config directories. When parallel builds invoke setuptools_scm, concurrent git status calls on the shared index cause timeouts on Lustre. Bypass git entirely by setting the pretend version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The throttle loop blocks the main process while waiting for build slots, so the post-loop heartbeat subprocess never starts until late. Add inline heartbeat every 120s during throttle waits showing the last log line of each active build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hardlinked source copies caused deterministic Cray compiler segfaults in optcg during IPA when compiling m_phase_change.fpp.f90 with OpenACC. Switching to real copies avoids inode sharing on Lustre that triggers the compiler bug. Also bump MAX_PARALLEL from 2 to 3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Phoenix runners work fine without the max-old-space-size override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ners The actions/checkout clean step fails with ESTALE errors on NFS-backed storage when build artifacts from previous runs have stale file handles. Pre-clean with rm -rf (which tolerates stale handles) and disable checkout's built-in clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
User description
Summary
Test plan
Summary by CodeRabbit
Chores
New Features
CodeAnt-AI Description
Run Frontier tests and benchmarks in consolidated multi-node SLURM jobs and run benchmarks in parallel with tests
What Changed
Impact
✅ Faster merges (shorter CI turnaround)✅ Fewer Frontier SLURM jobs in the queue✅ Fewer correctness runs cancelled by benchmark flakes💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.