Skip to content

Add dep check for ptq and runtime check for evaluation/deployment#1240

Merged
kaix-nv merged 3 commits intomainfrom
kaix/nemo_skills
Apr 16, 2026
Merged

Add dep check for ptq and runtime check for evaluation/deployment#1240
kaix-nv merged 3 commits intomainfrom
kaix/nemo_skills

Conversation

@kaix-nv
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv commented Apr 12, 2026

What does this PR do?

Type of change: ?

PTQ: model-specific dependency support

  • Add EXTRA_PIP_DEPS support to the launcher's ptq.sh so models requiring extra pip packages (e.g., mamba-ssm for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies.

Container registry auth checks

  • Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes.
  • Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note.
  • Add auth check before remote SLURM deployment.

Usage

Set EXTRA_PIP_DEPS in the launcher YAML's environment section:

task_0:
  script: common/hf/ptq.sh
  args:
    - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --
    - --quant nvfp4
    - --tasks quant
  environment:
    - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"

Testing

Tested end-to-end: NVFP4 quantization of NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Documentation

    • Added container registry authentication verification for SLURM deployments: credential checks, verification commands, common failure symptoms, remediation, and blocking job submission until auth is confirmed.
    • Added SLURM-only pre-submission verification and image-fallback recommendations.
    • Added model dependency check for trust_remote_code models and tightened build-config guidance.
    • Updated PTQ launcher documentation to reference the new PTQ wrapper.
  • New Features

    • Support for specifying extra pip dependencies via an environment variable before model processing.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7d987757-ea4d-47ea-9e61-4175c280987d

📥 Commits

Reviewing files that changed from the base of the PR and between 64d98f7 and 0b54c33.

📒 Files selected for processing (6)
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/ptq/SKILL.md
  • .claude/skills/ptq/references/launcher-guide.md
  • tools/launcher/common/hf/ptq.sh
✅ Files skipped from review due to trivial changes (3)
  • .claude/skills/ptq/references/launcher-guide.md
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/ptq/SKILL.md
  • tools/launcher/common/hf/ptq.sh

📝 Walkthrough

Walkthrough

Adds pre-submission container registry authentication checks and gating for SLURM jobs, inserts SLURM-auth verification steps in deployment/evaluation guides, adds PTQ model-specific dependency detection, and enables optional extra pip installs via EXTRA_PIP_DEPS in the PTQ launcher script.

Changes

Cohort / File(s) Summary
SLURM container auth docs & gating
.claude/skills/common/slurm-setup.md, .claude/skills/deployment/SKILL.md, .claude/skills/evaluation/SKILL.md
Added "Container Registry Authentication" section: runtime detection (enroot/docker), registry derivation from image URIs, checks for local credential files, non-job-submitting credential setup commands, verification commands (including singularity dry-run), common pull-failure diagnostics, and explicit gating that blocks SLURM submission until auth verified.
PTQ dependency detection & launcher changes
.claude/skills/ptq/SKILL.md, .claude/skills/ptq/references/launcher-guide.md, tools/launcher/common/hf/ptq.sh
Inserted Step 2.5 to detect trust_remote_code models (inspect config.json and scan modeling_*.py), map imports to pip packages, document handling via EXTRA_PIP_DEPS; updated references to use common/hf/ptq.sh; added pre-execution EXTRA_PIP_DEPS pip-install logic in the launcher script.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as Deployment CLI
    participant SLURM
    participant Node as Compute Node
    participant Registry as Container Registry

    User->>CLI: invoke deployment/evaluation that will submit a SLURM job
    CLI->>CLI: detect runtime (enroot/docker) and derive registry from image URI
    CLI->>CLI: check local credential locations (enroot/docker paths)
    alt credentials missing
        CLI->>User: output remediation commands or recommend authenticated image (e.g., NGC)
        User->>CLI: run credential setup locally or choose alternate image
    end
    CLI->>SLURM: submit job only after credentials verified
    SLURM->>Node: schedule job to compute node
    Node->>Registry: attempt image pull
    Registry-->>Node: image response (success or auth error)
    Node->>Node: run container if pull succeeds
Loading
sequenceDiagram
    participant User
    participant PTQGuide as PTQ Workflow
    participant Model as Model Config
    participant Scanner as Import Scanner
    participant Launcher as PTQ Launcher Script
    participant Env as Execution Environment

    User->>PTQGuide: start PTQ workflow
    PTQGuide->>Model: inspect config.json for trust_remote_code
    alt trust_remote_code = true
        PTQGuide->>Scanner: scan modeling_*.py for imports
        Scanner->>Scanner: map imports to pip packages (known mappings)
        Scanner-->>User: list required EXTRA_PIP_DEPS
        User->>Launcher: set EXTRA_PIP_DEPS and run launcher
    end
    Launcher->>Env: check EXTRA_PIP_DEPS
    alt EXTRA_PIP_DEPS non-empty
        Env->>Env: unset PIP_CONSTRAINT
        Env->>Env: pip install $EXTRA_PIP_DEPS
    end
    Launcher->>Launcher: proceed with model download and PTQ execution
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adding dependency checking for PTQ and runtime/registry authentication checks for evaluation and deployment workflows.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR contains only documentation updates and shell script changes with no Python code modifications, therefore no Python security anti-patterns apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kaix/nemo_skills

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 12, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-16 18:23 UTC

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.10%. Comparing base (f238d93) to head (0b54c33).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1240      +/-   ##
==========================================
- Coverage   75.58%   75.10%   -0.48%     
==========================================
  Files         459      459              
  Lines       48613    48613              
==========================================
- Hits        36745    36512     -233     
- Misses      11868    12101     +233     
Flag Coverage Δ
gpu 59.96% <ø> (-0.54%) ⬇️
unit 52.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kaix-nv kaix-nv requested review from Edwardf0t1 and mxinO April 13, 2026 02:57
@kaix-nv kaix-nv changed the title Add dependencies check for ptq Add dependencies check for ptq; Add runtime check for evaluation/deployment Apr 13, 2026
@kaix-nv kaix-nv changed the title Add dependencies check for ptq; Add runtime check for evaluation/deployment Add dep check for ptq and runtime check for evaluation/deployment Apr 13, 2026
@kaix-nv kaix-nv marked this pull request as ready for review April 14, 2026 03:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/slurm-setup.md:
- Around line 235-256: The snippets that directly cat credential files (the
commands referencing ~/.config/enroot/.credentials, ~/.docker/config.json, and
~/.singularity/docker-config.json or ~/.apptainer/docker-config.json) must not
be left as-is; change them to commands that only enumerate registry keys or
"machine <registry>" entries and redact any secrets instead of printing full
files. Locate the three occurrences (the enroot credentials check, the Docker
config json check, and the Singularity/Apptainer check) and replace the raw-dump
approach with parsing that extracts and prints only registry hostnames/machine
names (e.g., use a JSON parser like jq or a simple grep/awk pipeline) and ensure
token/auth fields are omitted or redacted in the output. Ensure the help text
describes that the check shows only registry keys/machine entries, not secrets.
- Around line 287-305: Replace the CLI password flags that expose secrets in the
commands "docker login nvcr.io -u '$oauthtoken' -p <ngc_api_key>" and
"singularity remote login --username '$oauthtoken' --password <ngc_api_key>
docker://nvcr.io" with stdin-based or interactive login flows: for Docker use
the --password-stdin flow (pipe the API key into docker login) and for
Singularity/Apptainer omit the --password flag to trigger an interactive prompt
or use the tool's supported stdin/token mechanism or environment variable helper
so secrets are not passed as command-line arguments.

In @.claude/skills/evaluation/SKILL.md:
- Line 31: The checklist was updated to add "Step 7.5: Check container registry
auth (SLURM only)" but the subsequent "Now, copy this checklist" block in
SKILL.md still skips 7.5; open .claude/skills/evaluation/SKILL.md, find the
"Now, copy this checklist" section and add the corresponding checkbox line for
Step 7.5 (matching the exact wording "Step 7.5: Check container registry auth
(SLURM only)") so the copyable checklist and any progress-tracking list include
that step and numbering remains consistent with the original checklist.
- Around line 202-214: The decision flow contradicts the pre-submit auth gate:
update the flow in the "Decision flow" section so authentication is verified and
handled before any job submission rather than submitting and reacting to a 401;
specifically, replace the current steps with: check DockerHub credentials first
(per the pre-submit auth gate), attempt to add DockerHub creds if missing, if
creds cannot be provided then set deployment.image to the NGC image
(nvcr.io/nvidia/vllm:<YY.MM>-py3) and only then submit the job, and keep the "Do
not retry more than once" rule; ensure references to "Default behavior",
"pre-submit auth gate", "deployment.image", and the NGC image string are updated
accordingly.
- Around line 199-200: Replace the unsafe raw credential dump command ssh <host>
"cat ~/.config/enroot/.credentials 2>/dev/null" with a non-leaking check that
only verifies presence of registry/machine entries or keys rather than printing
values; update the SKILL.md example so it inspects the file structure (e.g.,
look for "machine" entries or specific registry keys) or parses JSON/YAML to
report existence/status and masks or omits secret values, ensuring no full
credentials are echoed or logged.
- Line 80: The file has conflicting guidance: one sentence ("DON'T ALLOW FOR ANY
OTHER OPTIONS") hard-restricts accepted categories while another warns that CLI
options "may differ from this list" and says "Use the CLI's current options, not
this list"; reconcile by removing or softening the hard restriction and making
the doc authoritative only as an example, e.g., replace "DON'T ALLOW FOR ANY
OTHER OPTIONS" with a consistent instruction that callers should validate
against the CLI output (nel skills build-config --help) and accept CLI-provided
categories rather than rejecting options not listed in the doc; update the text
around the example categories and any validation guidance to explicitly prefer
the CLI's current options and to allow other valid CLI values.

In `@tools/launcher/common/hf/ptq.sh`:
- Around line 29-33: The pip install invocation uses unquoted $EXTRA_PIP_DEPS
which risks word-splitting/globbing; split the value into a shell array and
install using the array expansion to preserve each dependency token. For
example, parse EXTRA_PIP_DEPS into an array (e.g., read -r -a extras <<<
"$EXTRA_PIP_DEPS" or IFS=' ' read -r -a extras <<< "$EXTRA_PIP_DEPS") and
replace pip install $EXTRA_PIP_DEPS with pip install "${extras[@]}", keeping the
existing unset PIP_CONSTRAINT and condition around EXTRA_PIP_DEPS.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5232f7df-8043-4cd3-85fd-ec40f3a1bfcc

📥 Commits

Reviewing files that changed from the base of the PR and between 14b78ae and b1ac809.

📒 Files selected for processing (5)
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/deployment/SKILL.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/ptq/SKILL.md
  • tools/launcher/common/hf/ptq.sh

Comment thread .claude/skills/common/slurm-setup.md Outdated
Comment thread .claude/skills/common/slurm-setup.md
Comment thread .claude/skills/evaluation/SKILL.md
Comment thread .claude/skills/evaluation/SKILL.md Outdated
Comment thread .claude/skills/evaluation/SKILL.md Outdated
Comment thread .claude/skills/evaluation/SKILL.md Outdated
Comment thread tools/launcher/common/hf/ptq.sh

---

## 6. Container Registry Authentication
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in evn-setup.md for setting HF_TOKEN, docker login token, and ngc token.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a structural reorganization and I’m fine with either dirs.

Copy link
Copy Markdown
Contributor Author

@kaix-nv kaix-nv Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the issue again. We already have one environment-setup.md, adding another env-setup.md seems to be a bit confusing. Besides, the auth section is specifically about container registry credentials for SLURM job submission, such as detecting runtimes, checking credentials per runtime, and fixing auth before sbatch. That fits SLURM setup, not general environment setup. So I prefer to keep them in slurm-setup.md. cc @Edwardf0t1

Comment thread .claude/skills/common/slurm-setup.md Outdated
| Runtime | Typical clusters | SLURM integration |
| --- | --- | --- |
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
| **Singularity/Apptainer** | HPC / academic clusters | `singularity exec` inside job script |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About singularity and apptainer, seems we don't have clusters to test. Our overall workflow didn't consider this, we added only template for pyxis and docker, so maybe we can wait for community contribution about this, wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've removed them.

Edwardf0t1
Edwardf0t1 previously approved these changes Apr 14, 2026
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, left some comments.

Comment thread tools/launcher/common/hf/ptq.sh
Comment thread .claude/skills/evaluation/SKILL.md
Comment thread tools/launcher/common/hf/ptq.sh
Comment thread .claude/skills/evaluation/SKILL.md
@Edwardf0t1 Edwardf0t1 self-requested a review April 15, 2026 03:11
@Edwardf0t1 Edwardf0t1 dismissed their stale review April 15, 2026 03:12

Added new comments

Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 1 more minor comment.


Determine the registry from the image URI:

| Image pattern | Registry |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SGLang docker:lmsysorg/sglang:...

kaix-nv added 3 commits April 16, 2026 10:14
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
Signed-off-by: Kai Xu <kaix@nvidia.com>
@kaix-nv kaix-nv enabled auto-merge (squash) April 16, 2026 17:15
@kaix-nv kaix-nv merged commit 6ded36b into main Apr 16, 2026
40 checks passed
@kaix-nv kaix-nv deleted the kaix/nemo_skills branch April 16, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants