diff --git a/.claude/skills/common/end-to-end-workflow.md b/.claude/skills/common/end-to-end-workflow.md new file mode 100644 index 0000000000..1dae03c2e5 --- /dev/null +++ b/.claude/skills/common/end-to-end-workflow.md @@ -0,0 +1,70 @@ +# End-to-End Workflow: PTQ → Deploy → Eval + +This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy. + +## Pipeline Overview + +```text +PTQ (quantize) → Deployment (serve) → Evaluation (benchmark) +───────────────── ────────────────── ──────────────────────── +hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET) + ↓ ↓ ↓ +NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores + (safetensors) (http://host:8000) (results.yml) +``` + +## Workspace Continuity + +All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside: + +```text +workspaces/model-name-format/ + output/ ← PTQ checkpoint (safetensors + config.json) + eval_results/ ← NEL evaluation artifacts (results.yml per task) + eval_config.yaml ← NEL config for evaluation + scripts/ ← Custom run scripts (if needed) + logs/ ← SLURM job logs +``` + +When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run: + +```bash +ls workspaces/ +``` + +## Unsupported Models + +Models not in the verified support matrices require extra work at each stage: + +| Stage | What can go wrong | Reference | +|-------|-------------------|-----------| +| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` | +| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` | +| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` | + +Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too. + +## NEL Evaluation with Custom Deployments + +When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: + +```yaml +deployment: + command: >- + pip install "transformers>=5.0.0.dev0" --pre -q && + sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py && + ${deployment.base_command} +``` + +This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`). + +## Decision: NEL SLURM Executor vs NEL CI (JET) + +| Factor | NEL SLURM executor | NEL CI (JET) | +|--------|-------------------|--------------| +| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs | +| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage | +| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets | +| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` | +| **MLflow export** | Manual setup | Automatic | +| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` | diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md index 7c99a5c2a9..2e538fa466 100644 --- a/.claude/skills/common/remote-execution.md +++ b/.claude/skills/common/remote-execution.md @@ -28,6 +28,17 @@ clusters: default_cluster: my-cluster ``` +### Checkpoint and storage availability + +Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes: + +| Cluster type | Compute-node storage | NOT accessible from compute nodes | +|-------------|---------------------|----------------------------------| +| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts | +| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths | + +If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically. + See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types. --- diff --git a/.claude/skills/common/slurm-setup.md b/.claude/skills/common/slurm-setup.md index 37b9fbd56a..f26731d883 100644 --- a/.claude/skills/common/slurm-setup.md +++ b/.claude/skills/common/slurm-setup.md @@ -51,6 +51,20 @@ srun \ " ``` +### Container registry credentials (pyxis) + +If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing: + +```bash +cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials" +# To add NGC credentials: +mkdir -p ~/.config/enroot +echo 'machine nvcr.io login $oauthtoken password ' > ~/.config/enroot/.credentials +chmod 600 ~/.config/enroot/.credentials +``` + +Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`. + Submit and capture the job ID: ```bash diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md index bd32916632..5d85e91186 100644 --- a/.claude/skills/common/workspace-management.md +++ b/.claude/skills/common/workspace-management.md @@ -92,6 +92,21 @@ rsync -a --quiet \ "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT//" ``` +## Cross-Skill Workspace Flow + +Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory: + +```text +workspaces/model-name-format/ + output/ ← PTQ: quantized checkpoint + eval_results/ ← Evaluation: NEL artifacts (results.yml per task) + eval_config.yaml ← Evaluation: NEL config + scripts/ ← Deployment/PTQ: custom run scripts + logs/ ← All: SLURM job logs +``` + +See `skills/common/end-to-end-workflow.md` for the full pipeline. + ## Example Flow ```text @@ -104,6 +119,10 @@ User: "deploy the model I just quantized" Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4" → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/ +User: "evaluate the quantized model on MMLU and GSM8K" +Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4" + → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/ + User: "now quantize Llama-3.1-8B with fp8" Agent: ls workspaces/ → no llama → mkdir workspaces/llama-3.1-8b-fp8 diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index f8eab5561b..5174e7befa 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -12,10 +12,12 @@ license: Apache-2.0 You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below. -### Workspace (multi-user / Slack bot) +### Workspace and Pipeline Integration If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications. +This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline. + ### Workflow ```text @@ -286,6 +288,19 @@ After job submission, you can monitor progress using: --- +### NEL CI and Cluster-Specific Notes + +For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers: +- NEL CI GitLab trigger pattern vs NEL SLURM executor +- Cluster-specific GPU counts and storage paths +- Checkpoint availability (compute nodes may not share login node filesystems) +- Environment variable prefixes (`host:`, `lit:`) for SLURM executor +- SGLang must bind `--host 0.0.0.0` for health checks +- Directory setup and `chmod 777` for JET service account access +- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting) + +--- + Direct users with issues to: - **GitHub Issues:** diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md new file mode 100644 index 0000000000..846d0236c8 --- /dev/null +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -0,0 +1,276 @@ +# NEL CI Evaluation Guide + +NEL CI is the recommended entry point for running evaluations on NVIDIA JET infrastructure. This guide covers patterns for evaluating quantized checkpoints using both the NEL SLURM executor (direct) and the NEL CI GitLab pipeline. + +Reference repo: `gitlab-master.nvidia.com/dl/JoC/competitive_evaluation/nemo-evaluator-launcher-ci` + +--- + +## 1. Two Execution Paths + +| Path | When to use | How it works | +|------|-------------|--------------| +| **NEL SLURM executor** | You have SSH access to the cluster, checkpoint is on cluster storage | `nel run --config config.yaml` from your workstation; NEL SSHes to cluster and submits sbatch jobs | +| **NEL CI GitLab pipeline** | You want managed infrastructure, MLflow export, reproducible configs | Trigger via GitLab API or UI; JET orchestrates everything | + +### NEL SLURM executor + +Best for iterative development and debugging. Run from any machine with SSH access to the cluster: + +```bash +export DUMMY_API_KEY=dummy +export HF_TOKEN= + +nel run --config eval_config.yaml \ + -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 # test first +``` + +### NEL CI trigger + +Best for production evaluations with MLflow tracking. See the trigger script pattern in section 4. + +--- + +## 2. Cluster Reference + +| Cluster | GPUs/Node | Architecture | Max Walltime | Storage | Notes | +|---------|-----------|-------------|--------------|---------|-------| +| oci-hsg | 4 | GB200 | 4 hours | `/lustre/` | Set `tensor_parallel_size=4` | +| cw | 8 | H100 | — | `/lustre/` | — | +| oci-nrt | 8 | H100 | — | `/lustre/` | Numerics configs | +| dlcluster | 4 (B100 partition) | B100 | 8 hours | `/home/omniml_data_*` | No `/lustre/`; use local NFS paths | + +**Important**: `deployment.tensor_parallel_size` determines how many GPUs are requested. If this exceeds the cluster's GPUs per node, the job fails with a memory allocation error. + +--- + +## 3. Checkpoint Availability + +The checkpoint must be on a filesystem accessible from the cluster's **compute nodes** (not just login nodes). + +| Cluster type | Accessible storage | NOT accessible | +|-------------|-------------------|----------------| +| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation paths (`/home/scratch.*`), NFS mounts from other clusters | +| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` (not available) | + +If the checkpoint is on a workstation, **copy it to cluster storage first**: + +```bash +rsync -av /path/to/local/checkpoint \ + :/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/ +``` + +**Cross-cluster copy** (e.g., dlcluster → oci-hsg): If the two clusters can't SSH to each other directly, pipe through your workstation without staging to disk: + +```bash +ssh user@source-cluster "tar czf - -C /path/to/checkpoint ." | \ + ssh user@target-cluster "tar xzf - -C /lustre/.../checkpoints/model-name" +``` + +After copying, set permissions for svc-jet: `chmod -R 777 /lustre/.../checkpoints/model-name` + +For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes. + +--- + +## 4. NEL CI Trigger Pattern + +For JET clusters, trigger evaluations via the GitLab API. + +### Simple deployment (standard models) + +For models that work with stock vLLM/SGLang, use `NEL_DEPLOYMENT_COMMAND` directly: + +```bash +export GITLAB_TOKEN= + +curl -k --request POST \ + --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \ + --header "Content-Type: application/json" \ + --data '{ + "ref": "main", + "variables": [ + {"key": "NEL_CONFIG_PATH", "value": "configs/AA/minimax_m2_5_lbd_lax.yaml"}, + {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"}, + {"key": "NEL_CLUSTER", "value": "oci-hsg"}, + {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"}, + {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"}, + {"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"}, + {"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"}, + {"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"}, + {"key": "NEL_VLLM_CACHE", "value": "/lustre/.../cache/vllm"}, + {"key": "NEL_CLUSTER_OUTPUT_DIR", "value": "/lustre/.../nv-eval-rundirs"} + ] + }' \ + "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline" +``` + +### Complex deployment (unsupported models needing runtime patches) + +If the model needs runtime patches (e.g., transformers upgrade, framework source fixes), **do NOT put multi-step commands in `NEL_DEPLOYMENT_COMMAND`** — Hydra's override parser will break on nested quotes, `&&`, `$()`, etc. + +Instead, use the **wrapper script pattern**: place a `serve.sh` in the checkpoint directory on the cluster, then point `NEL_DEPLOYMENT_COMMAND` to it. + +**Step 1** — Write wrapper script to the checkpoint directory on the cluster: + +```bash +ssh 'cat > /lustre/.../checkpoint/serve.sh << '"'"'EOF'"'"' +#!/bin/bash +set -e +pip install "transformers>=5.0.0.dev0" "huggingface_hub>=0.32.0" --pre -q +# Patch vLLM for ministral3 support (example) +MISTRAL3_PY=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1) +sed -i "s/old_pattern/new_pattern/" "$MISTRAL3_PY" +exec vllm serve /checkpoint --host 0.0.0.0 --port 8000 \ + --tensor-parallel-size 4 --quantization modelopt_fp4 \ + --trust-remote-code --served-model-name my-model --gpu-memory-utilization 0.9 +EOF +chmod 777 /lustre/.../checkpoint/serve.sh' +``` + +**Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper: + +```json +{"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"} +``` + +This works because the checkpoint is mounted at `/checkpoint` inside the container. The script is Hydra-safe (no special characters in the override value). + +### Custom configs with `NEL_CONFIG_BASE64` + +When using a custom config (not from the repo), use `NEL_CONFIG_BASE64` instead of `NEL_CONFIG_PATH`. This requires setting `NEL_UNTRUSTED_EVAL=true`: + +```python +import json, base64, subprocess, os + +with open("my_config.yaml") as f: + config_b64 = base64.b64encode(f.read().encode()).decode() + +payload = { + "ref": "main", + "variables": [ + {"key": "NEL_CONFIG_BASE64", "value": config_b64}, + {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"}, + {"key": "NEL_CLUSTER", "value": "oci-hsg"}, + {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"}, + {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"}, + {"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}, + {"key": "NEL_UNTRUSTED_EVAL", "value": "true"}, + # ... other variables + ] +} + +# Use Python to construct JSON (avoids shell escaping issues with curl) +token = os.environ["GITLAB_TOKEN"] +subprocess.run( + ["curl", "-k", "--request", "POST", + "--header", f"PRIVATE-TOKEN: {token}", + "--header", "Content-Type: application/json", + "--data", json.dumps(payload), + "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"], +) +``` + +> **Tip**: Use Python (not bash) to construct the JSON payload for `curl`. Shell escaping of base64 strings and nested quotes is error-prone. + +--- + +## 5. Environment Variables + +### SLURM executor format + +Env vars in NEL SLURM configs require explicit prefixes: + +| Prefix | Meaning | Example | +|--------|---------|---------| +| `host:VAR_NAME` | Read from the host environment where `nel run` is executed | `host:HF_TOKEN` | +| `lit:value` | Literal string value | `lit:dummy` | + +```yaml +evaluation: + env_vars: + DUMMY_API_KEY: host:DUMMY_API_KEY + HF_TOKEN: host:HF_TOKEN +``` + +### JET executor format + +JET configs reference JET secrets with `$SECRET_NAME`: + +```yaml +execution: + env_vars: + evaluation: + HF_TOKEN: $COMPEVAL_HF_TOKEN +``` + +### Gated datasets + +Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. + +**NEL CI (JET)**: Handled automatically — the `COMPEVAL_HF_TOKEN` JET secret is pre-configured by the eval platform team. No user action needed; you don't even need personal access to the gated dataset. + +**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at for GPQA). + +```yaml +evaluation: + env_vars: + HF_TOKEN: host:HF_TOKEN # SLURM executor — reads from your local env + tasks: + - name: simple_evals.gpqa_diamond + env_vars: + HF_TOKEN: host:HF_TOKEN +``` + +--- + +## 6. Serving Framework Notes + +### vLLM + +- Binds to `0.0.0.0` by default — health checks work out of the box +- For NVFP4: `--quantization modelopt_fp4` +- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`) + +### SGLang + +- **Must include `--host 0.0.0.0`** — SGLang defaults to `127.0.0.1` which blocks health checks from the eval client +- Must include `--port 8000` to match NEL's expected port +- For NVFP4: `--quantization modelopt_fp4` + +--- + +## 7. Common Issues + +| Issue | Cause | Fix | +|-------|-------|-----| +| `401 Unauthorized` pulling eval container | NGC credentials not set on cluster | Set up `~/.config/enroot/.credentials` with NGC API key | +| `PermissionError: /hf-cache/...` | HF cache dir not writable by svc-jet | Set `NEL_HF_HOME` to your own `chmod 777` directory | +| Health check stuck at `000` | Server binding to localhost | Add `--host 0.0.0.0` to deployment command (SGLang) | +| `Memory required by task is not available` | TP size exceeds GPUs/node | Set `tensor_parallel_size` to match cluster (4 for oci-hsg, dlcluster B100) | +| TIMEOUT after eval completes | Walltime too short for eval + MLflow export | Set `execution.walltime=04:00:00` | +| Gated dataset auth failure | `HF_TOKEN` not passed to eval container | Add `env_vars.HF_TOKEN` at evaluation or task level | +| `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead | +| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first | +| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config | +| `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` | +| `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation | +| `The model does not exist` (404) | Eval client uses checkpoint path as model_id instead of served_model_name | Add `deployment.served_model_name=` to `NEL_OTHER_OVERRIDES` to match `--served-model-name` in your serve command | + +--- + +## 8. Directory Setup for JET Clusters + +Before running evaluations on a JET cluster, create writable directories: + +```bash +ssh +mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface +mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm +mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs +chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface +chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm +chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs +``` + +`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access. diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md index 932f62ec2c..79074dbd6e 100644 --- a/.claude/skills/ptq/SKILL.md +++ b/.claude/skills/ptq/SKILL.md @@ -113,6 +113,8 @@ ls -lh / Report the path and size to the user. +**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over — see `skills/common/end-to-end-workflow.md` for the full PTQ → Deploy → Eval pipeline. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time. + ## Key API Rules - `mtq.register()` classes **must** define `_setup()` and call it from `__init__`