Add dep check for ptq and runtime check for evaluation/deployment#1240
Add dep check for ptq and runtime check for evaluation/deployment#1240
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds pre-submission container registry authentication checks and gating for SLURM jobs, inserts SLURM-auth verification steps in deployment/evaluation guides, adds PTQ model-specific dependency detection, and enables optional extra pip installs via Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as Deployment CLI
participant SLURM
participant Node as Compute Node
participant Registry as Container Registry
User->>CLI: invoke deployment/evaluation that will submit a SLURM job
CLI->>CLI: detect runtime (enroot/docker) and derive registry from image URI
CLI->>CLI: check local credential locations (enroot/docker paths)
alt credentials missing
CLI->>User: output remediation commands or recommend authenticated image (e.g., NGC)
User->>CLI: run credential setup locally or choose alternate image
end
CLI->>SLURM: submit job only after credentials verified
SLURM->>Node: schedule job to compute node
Node->>Registry: attempt image pull
Registry-->>Node: image response (success or auth error)
Node->>Node: run container if pull succeeds
sequenceDiagram
participant User
participant PTQGuide as PTQ Workflow
participant Model as Model Config
participant Scanner as Import Scanner
participant Launcher as PTQ Launcher Script
participant Env as Execution Environment
User->>PTQGuide: start PTQ workflow
PTQGuide->>Model: inspect config.json for trust_remote_code
alt trust_remote_code = true
PTQGuide->>Scanner: scan modeling_*.py for imports
Scanner->>Scanner: map imports to pip packages (known mappings)
Scanner-->>User: list required EXTRA_PIP_DEPS
User->>Launcher: set EXTRA_PIP_DEPS and run launcher
end
Launcher->>Env: check EXTRA_PIP_DEPS
alt EXTRA_PIP_DEPS non-empty
Env->>Env: unset PIP_CONSTRAINT
Env->>Env: pip install $EXTRA_PIP_DEPS
end
Launcher->>Launcher: proceed with model download and PTQ execution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1240 +/- ##
==========================================
- Coverage 75.58% 75.10% -0.48%
==========================================
Files 459 459
Lines 48613 48613
==========================================
- Hits 36745 36512 -233
- Misses 11868 12101 +233
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dcc8c9a to
b1ac809
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/common/slurm-setup.md:
- Around line 235-256: The snippets that directly cat credential files (the
commands referencing ~/.config/enroot/.credentials, ~/.docker/config.json, and
~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not
be left as-is; change them to commands that only enumerate registry keys or
"machine <registry>" entries and redact any secrets instead of printing full
files. Locate the three occurrences (the enroot credentials check, the Docker
config json check, and the Singularity/Apptainer check) and replace the raw-dump
approach with parsing that extracts and prints only registry hostnames/machine
names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure
token/auth fields are omitted or redacted in the output. Ensure the help text
describes that the check shows only registry keys/machine entries, not secrets.
- Around line 287-305: Replace the CLI password flags that expose secrets in the
commands "docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>" and
"singularity remote login --username '$oauthtoken' --password <ngc_api_key>
docker://nvcr.io" with stdin-based or interactive login flows: for Docker use
the --password-stdin flow (pipe the API key into docker login) and for
Singularity/Apptainer omit the --password flag to trigger an interactive prompt
or use the tool's supported stdin/token mechanism or environment variable helper
so secrets are not passed as command-line arguments.
In @.claude/skills/evaluation/SKILL.md:
- Line 31: The checklist was updated to add "Step 7.5: Check container registry
auth (SLURM only)" but the subsequent "Now, copy this checklist" block in
SKILL.md still skips 7.5; open .claude/skills/evaluation/SKILL.md, find the
"Now, copy this checklist" section and add the corresponding checkbox line for
Step 7.5 (matching the exact wording "Step 7.5: Check container registry auth
(SLURM only)") so the copyable checklist and any progress-tracking list include
that step and numbering remains consistent with the original checklist.
- Around line 202-214: The decision flow contradicts the pre-submit auth gate:
update the flow in the "Decision flow" section so authentication is verified and
handled before any job submission rather than submitting and reacting to a 401;
specifically, replace the current steps with: check DockerHub credentials first
(per the pre-submit auth gate), attempt to add DockerHub creds if missing, if
creds cannot be provided then set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and only then submit the job, and keep the "Do
not retry more than once" rule; ensure references to "Default behavior",
"pre-submit auth gate", "deployment.image", and the NGC image string are updated
accordingly.
- Around line 199-200: Replace the unsafe raw credential dump command ssh <host>
"cat ~/.config/enroot/.credentials 2>/dev/null" with a non-leaking check that
only verifies presence of registry/machine entries or keys rather than printing
values; update the SKILL.md example so it inspects the file structure (e.g.,
look for "machine" entries or specific registry keys) or parses JSON/YAML to
report existence/status and masks or omits secret values, ensuring no full
credentials are echoed or logged.
- Line 80: The file has conflicting guidance: one sentence ("DON'T ALLOW FOR ANY
OTHER OPTIONS") hard-restricts accepted categories while another warns that CLI
options "may differ from this list" and says "Use the CLI's current options, not
this list"; reconcile by removing or softening the hard restriction and making
the doc authoritative only as an example, e.g., replace "DON'T ALLOW FOR ANY
OTHER OPTIONS" with a consistent instruction that callers should validate
against the CLI output (nel skills build-config --help) and accept CLI-provided
categories rather than rejecting options not listed in the doc; update the text
around the example categories and any validation guidance to explicitly prefer
the CLI's current options and to allow other valid CLI values.
In `@tools/launcher/common/hf/ptq.sh`:
- Around line 29-33: The pip install invocation uses unquoted $EXTRA_PIP_DEPS
which risks word-splitting/globbing; split the value into a shell array and
install using the array expansion to preserve each dependency token. For
example, parse EXTRA_PIP_DEPS into an array (e.g., read -r -a extras <<<
"$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<< "$EXTRA_PIP_DEPS") and
replace pip install $EXTRA_PIP_DEPS with pip install "${extras[@]}", keeping the
existing unset PIP_CONSTRAINT and condition around EXTRA_PIP_DEPS.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5232f7df-8043-4cd3-85fd-ec40f3a1bfcc
📒 Files selected for processing (5)
.claude/skills/common/slurm-setup.md.claude/skills/deployment/SKILL.md.claude/skills/evaluation/SKILL.md.claude/skills/ptq/SKILL.mdtools/launcher/common/hf/ptq.sh
|
|
||
| --- | ||
|
|
||
| ## 6. Container Registry Authentication |
There was a problem hiding this comment.
This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in evn-setup.md for setting HF_TOKEN, docker login token, and ngc token.
There was a problem hiding this comment.
This is a structural reorganization and I’m fine with either dirs.
There was a problem hiding this comment.
Check the issue again. We already have one environment-setup.md, adding another env-setup.md seems to be a bit confusing. Besides, the auth section is specifically about container registry credentials for SLURM job submission, such as detecting runtimes, checking credentials per runtime, and fixing auth before sbatch. That fits SLURM setup, not general environment setup. So I prefer to keep them in slurm-setup.md. cc @Edwardf0t1
| | Runtime | Typical clusters | SLURM integration | | ||
| | --- | --- | --- | | ||
| | **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` | | ||
| | **Singularity/Apptainer** | HPC / academic clusters | `singularity exec` inside job script | |
There was a problem hiding this comment.
About singularity and apptainer, seems we don't have clusters to test. Our overall workflow didn't consider this, we added only template for pyxis and docker, so maybe we can wait for community contribution about this, wdyt?
There was a problem hiding this comment.
Sure, I've removed them.
Edwardf0t1
left a comment
There was a problem hiding this comment.
LGTM in general, left some comments.
b1ac809 to
64d98f7
Compare
Edwardf0t1
left a comment
There was a problem hiding this comment.
Left 1 more minor comment.
|
|
||
| Determine the registry from the image URI: | ||
|
|
||
| | Image pattern | Registry | |
There was a problem hiding this comment.
Add SGLang docker:lmsysorg/sglang:...
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
64d98f7 to
0b54c33
Compare
What does this PR do?
Type of change: ?
PTQ: model-specific dependency support
ptq.shso models requiring extra pip packages (e.g.,mamba-ssmfor hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies.Container registry auth checks
Usage
Set EXTRA_PIP_DEPS in the launcher YAML's environment section:
Testing
Tested end-to-end: NVFP4 quantization of
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards).Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Documentation
New Features