Add dep check for ptq and runtime check for evaluation/deployment by kaix-nv · Pull Request #1240 · NVIDIA/Model-Optimizer

kaix-nv · 2026-04-12T23:05:21Z

What does this PR do?

Type of change: ?

PTQ: model-specific dependency support

Add EXTRA_PIP_DEPS support to the launcher's ptq.sh so models requiring extra pip packages (e.g., mamba-ssm for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies.

Container registry auth checks

Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes.
Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note.
Add auth check before remote SLURM deployment.

Usage

Set EXTRA_PIP_DEPS in the launcher YAML's environment section:

task_0:
  script: common/hf/ptq.sh
  args:
    - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --
    - --quant nvfp4
    - --tasks quant
  environment:
    - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"

Testing

Tested end-to-end: NVFP4 quantization of NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Documentation
- Added container registry authentication verification for SLURM deployments: credential checks, verification commands, common failure symptoms, remediation, and blocking job submission until auth is confirmed.
- Added SLURM-only pre-submission verification and image-fallback recommendations.
- Added model dependency check for trust_remote_code models and tightened build-config guidance.
- Updated PTQ launcher documentation to reference the new PTQ wrapper.
New Features
- Support for specifying extra pip dependencies via an environment variable before model processing.

copy-pr-bot · 2026-04-12T23:05:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-12T23:05:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7d987757-ea4d-47ea-9e61-4175c280987d

📥 Commits

Reviewing files that changed from the base of the PR and between 64d98f7 and 0b54c33.

📒 Files selected for processing (6)

.claude/skills/common/slurm-setup.md
.claude/skills/deployment/SKILL.md
.claude/skills/evaluation/SKILL.md
.claude/skills/ptq/SKILL.md
.claude/skills/ptq/references/launcher-guide.md
tools/launcher/common/hf/ptq.sh

✅ Files skipped from review due to trivial changes (3)

.claude/skills/ptq/references/launcher-guide.md
.claude/skills/deployment/SKILL.md
.claude/skills/evaluation/SKILL.md

🚧 Files skipped from review as they are similar to previous changes (3)

.claude/skills/common/slurm-setup.md
.claude/skills/ptq/SKILL.md
tools/launcher/common/hf/ptq.sh

📝 Walkthrough

Walkthrough

Adds pre-submission container registry authentication checks and gating for SLURM jobs, inserts SLURM-auth verification steps in deployment/evaluation guides, adds PTQ model-specific dependency detection, and enables optional extra pip installs via EXTRA_PIP_DEPS in the PTQ launcher script.

Changes

Cohort / File(s)	Summary
SLURM container auth docs & gating `.claude/skills/common/slurm-setup.md`, `.claude/skills/deployment/SKILL.md`, `.claude/skills/evaluation/SKILL.md`	Added "Container Registry Authentication" section: runtime detection (enroot/docker), registry derivation from image URIs, checks for local credential files, non-job-submitting credential setup commands, verification commands (including singularity dry-run), common pull-failure diagnostics, and explicit gating that blocks SLURM submission until auth verified.
PTQ dependency detection & launcher changes `.claude/skills/ptq/SKILL.md`, `.claude/skills/ptq/references/launcher-guide.md`, `tools/launcher/common/hf/ptq.sh`	Inserted Step 2.5 to detect `trust_remote_code` models (inspect `config.json` and scan `modeling_*.py`), map imports to pip packages, document handling via `EXTRA_PIP_DEPS`; updated references to use `common/hf/ptq.sh`; added pre-execution `EXTRA_PIP_DEPS` pip-install logic in the launcher script.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Deployment CLI
    participant SLURM
    participant Node as Compute Node
    participant Registry as Container Registry

    User->>CLI: invoke deployment/evaluation that will submit a SLURM job
    CLI->>CLI: detect runtime (enroot/docker) and derive registry from image URI
    CLI->>CLI: check local credential locations (enroot/docker paths)
    alt credentials missing
        CLI->>User: output remediation commands or recommend authenticated image (e.g., NGC)
        User->>CLI: run credential setup locally or choose alternate image
    end
    CLI->>SLURM: submit job only after credentials verified
    SLURM->>Node: schedule job to compute node
    Node->>Registry: attempt image pull
    Registry-->>Node: image response (success or auth error)
    Node->>Node: run container if pull succeeds

sequenceDiagram
    participant User
    participant PTQGuide as PTQ Workflow
    participant Model as Model Config
    participant Scanner as Import Scanner
    participant Launcher as PTQ Launcher Script
    participant Env as Execution Environment

    User->>PTQGuide: start PTQ workflow
    PTQGuide->>Model: inspect config.json for trust_remote_code
    alt trust_remote_code = true
        PTQGuide->>Scanner: scan modeling_*.py for imports
        Scanner->>Scanner: map imports to pip packages (known mappings)
        Scanner-->>User: list required EXTRA_PIP_DEPS
        User->>Launcher: set EXTRA_PIP_DEPS and run launcher
    end
    Launcher->>Env: check EXTRA_PIP_DEPS
    alt EXTRA_PIP_DEPS non-empty
        Env->>Env: unset PIP_CONSTRAINT
        Env->>Env: pip install $EXTRA_PIP_DEPS
    end
    Launcher->>Launcher: proceed with model download and PTQ execution

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: adding dependency checking for PTQ and runtime/registry authentication checks for evaluation and deployment workflows.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR contains only documentation updates and shell script changes with no Python code modifications, therefore no Python security anti-patterns apply.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kaix/nemo_skills

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-12T23:10:08Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-16 18:23 UTC

codecov · 2026-04-12T23:17:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.10%. Comparing base (f238d93) to head (0b54c33).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1240      +/-   ##
==========================================
- Coverage   75.58%   75.10%   -0.48%     
==========================================
  Files         459      459              
  Lines       48613    48613              
==========================================
- Hits        36745    36512     -233     
- Misses      11868    12101     +233

Flag	Coverage Δ
gpu	`59.96% <ø> (-0.54%)`	⬇️
unit	`52.20% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/slurm-setup.md:
- Around line 235-256: The snippets that directly cat credential files (the
commands referencing ~/.config/enroot/.credentials, ~/.docker/config.json, and
~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not
be left as-is; change them to commands that only enumerate registry keys or
"machine <registry>" entries and redact any secrets instead of printing full
files. Locate the three occurrences (the enroot credentials check, the Docker
config json check, and the Singularity/Apptainer check) and replace the raw-dump
approach with parsing that extracts and prints only registry hostnames/machine
names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure
token/auth fields are omitted or redacted in the output. Ensure the help text
describes that the check shows only registry keys/machine entries, not secrets.
- Around line 287-305: Replace the CLI password flags that expose secrets in the
commands "docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>" and
"singularity remote login --username '$oauthtoken' --password <ngc_api_key>
docker://nvcr.io" with stdin-based or interactive login flows: for Docker use
the --password-stdin flow (pipe the API key into docker login) and for
Singularity/Apptainer omit the --password flag to trigger an interactive prompt
or use the tool's supported stdin/token mechanism or environment variable helper
so secrets are not passed as command-line arguments.

In @.claude/skills/evaluation/SKILL.md:
- Line 31: The checklist was updated to add "Step 7.5: Check container registry
auth (SLURM only)" but the subsequent "Now, copy this checklist" block in
SKILL.md still skips 7.5; open .claude/skills/evaluation/SKILL.md, find the
"Now, copy this checklist" section and add the corresponding checkbox line for
Step 7.5 (matching the exact wording "Step 7.5: Check container registry auth
(SLURM only)") so the copyable checklist and any progress-tracking list include
that step and numbering remains consistent with the original checklist.
- Around line 202-214: The decision flow contradicts the pre-submit auth gate:
update the flow in the "Decision flow" section so authentication is verified and
handled before any job submission rather than submitting and reacting to a 401;
specifically, replace the current steps with: check DockerHub credentials first
(per the pre-submit auth gate), attempt to add DockerHub creds if missing, if
creds cannot be provided then set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and only then submit the job, and keep the "Do
not retry more than once" rule; ensure references to "Default behavior",
"pre-submit auth gate", "deployment.image", and the NGC image string are updated
accordingly.
- Around line 199-200: Replace the unsafe raw credential dump command ssh <host>
"cat ~/.config/enroot/.credentials 2>/dev/null" with a non-leaking check that
only verifies presence of registry/machine entries or keys rather than printing
values; update the SKILL.md example so it inspects the file structure (e.g.,
look for "machine" entries or specific registry keys) or parses JSON/YAML to
report existence/status and masks or omits secret values, ensuring no full
credentials are echoed or logged.
- Line 80: The file has conflicting guidance: one sentence ("DON'T ALLOW FOR ANY
OTHER OPTIONS") hard-restricts accepted categories while another warns that CLI
options "may differ from this list" and says "Use the CLI's current options, not
this list"; reconcile by removing or softening the hard restriction and making
the doc authoritative only as an example, e.g., replace "DON'T ALLOW FOR ANY
OTHER OPTIONS" with a consistent instruction that callers should validate
against the CLI output (nel skills build-config --help) and accept CLI-provided
categories rather than rejecting options not listed in the doc; update the text
around the example categories and any validation guidance to explicitly prefer
the CLI's current options and to allow other valid CLI values.

In `@tools/launcher/common/hf/ptq.sh`:
- Around line 29-33: The pip install invocation uses unquoted $EXTRA_PIP_DEPS
which risks word-splitting/globbing; split the value into a shell array and
install using the array expansion to preserve each dependency token. For
example, parse EXTRA_PIP_DEPS into an array (e.g., read -r -a extras <<<
"$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<< "$EXTRA_PIP_DEPS") and
replace pip install $EXTRA_PIP_DEPS with pip install "${extras[@]}", keeping the
existing unset PIP_CONSTRAINT and condition around EXTRA_PIP_DEPS.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5232f7df-8043-4cd3-85fd-ec40f3a1bfcc

📥 Commits

Reviewing files that changed from the base of the PR and between 14b78ae and b1ac809.

📒 Files selected for processing (5)

.claude/skills/common/slurm-setup.md
.claude/skills/deployment/SKILL.md
.claude/skills/evaluation/SKILL.md
.claude/skills/ptq/SKILL.md
tools/launcher/common/hf/ptq.sh

mxinO · 2026-04-14T08:00:39Z

+
+---
+
+## 6. Container Registry Authentication


This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in evn-setup.md for setting HF_TOKEN, docker login token, and ngc token.

This is a structural reorganization and I’m fine with either dirs.

Check the issue again. We already have one environment-setup.md, adding another env-setup.md seems to be a bit confusing. Besides, the auth section is specifically about container registry credentials for SLURM job submission, such as detecting runtimes, checking credentials per runtime, and fixing auth before sbatch. That fits SLURM setup, not general environment setup. So I prefer to keep them in slurm-setup.md. cc @Edwardf0t1

mxinO · 2026-04-14T08:04:49Z

+| Runtime | Typical clusters | SLURM integration |
+| --- | --- | --- |
+| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
+| **Singularity/Apptainer** | HPC / academic clusters | `singularity exec` inside job script |


About singularity and apptainer, seems we don't have clusters to test. Our overall workflow didn't consider this, we added only template for pyxis and docker, so maybe we can wait for community contribution about this, wdyt?

Sure, I've removed them.

Edwardf0t1

LGTM in general, left some comments.

Added new comments

Edwardf0t1

Left 1 more minor comment.

Edwardf0t1 · 2026-04-16T08:55:03Z

+
+Determine the registry from the image URI:
+
+| Image pattern | Registry |


Add SGLang docker:lmsysorg/sglang:...

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv requested review from Edwardf0t1 and mxinO April 13, 2026 02:57

kaix-nv force-pushed the kaix/nemo_skills branch from dcc8c9a to b1ac809 Compare April 13, 2026 21:16

kaix-nv changed the title ~~Add dependencies check for ptq~~ Add dependencies check for ptq; Add runtime check for evaluation/deployment Apr 13, 2026

kaix-nv changed the title ~~Add dependencies check for ptq; Add runtime check for evaluation/deployment~~ Add dep check for ptq and runtime check for evaluation/deployment Apr 13, 2026

kaix-nv marked this pull request as ready for review April 14, 2026 03:49

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

mxinO reviewed Apr 14, 2026

View reviewed changes

Edwardf0t1 previously approved these changes Apr 14, 2026

View reviewed changes

Comment thread tools/launcher/common/hf/ptq.sh

Comment thread .claude/skills/evaluation/SKILL.md

Edwardf0t1 reviewed Apr 15, 2026

View reviewed changes

Comment thread tools/launcher/common/hf/ptq.sh

Comment thread .claude/skills/evaluation/SKILL.md

Edwardf0t1 self-requested a review April 15, 2026 03:11

kaix-nv force-pushed the kaix/nemo_skills branch from b1ac809 to 64d98f7 Compare April 16, 2026 01:03

Edwardf0t1 approved these changes Apr 16, 2026

View reviewed changes

kaix-nv added 3 commits April 16, 2026 10:14

Add dependencies check for ptq

7a51a6f

Signed-off-by: Kai Xu <kaix@nvidia.com>

Add remote authentication check

e6c5e28

Signed-off-by: Kai Xu <kaix@nvidia.com>

Address feedbacks

0b54c33

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/nemo_skills branch from 64d98f7 to 0b54c33 Compare April 16, 2026 17:15

kaix-nv enabled auto-merge (squash) April 16, 2026 17:15

kaix-nv merged commit 6ded36b into main Apr 16, 2026
40 checks passed

kaix-nv deleted the kaix/nemo_skills branch April 16, 2026 18:21


		Determine the registry from the image URI:

		\| Image pattern \| Registry \|

Conversation

kaix-nv commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 12, 2026

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mxinO Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mxinO Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

kaix-nv Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaix-nv commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

github-actions bot commented Apr 12, 2026 •

edited

Loading

codecov bot commented Apr 12, 2026 •

edited

Loading

kaix-nv Apr 16, 2026 •

edited

Loading