Tk/infra by terrykong · Pull Request #2238 · NVIDIA-NeMo/RL

terrykong · 2026-04-09T06:37:37Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Local K8s GPU dev environment using nvkind (NVIDIA's kind wrapper): - nvkind cluster setup scripts (install-nvkind.sh, create-cluster.sh) - Custom config template with extraMounts for dev code mounting - Helmfile with kind/prod environments (device plugin vs GPU operator) - KAI scheduler for gang scheduling, KubeRay for RayCluster management - Example manifests: gang-scheduled pods, RayClusters, SFT RayJobs - SETUP.md with prerequisites, quick start, and architecture docs Tested: SFT RayJob (train/loss 4.06 < 5.9), KAI all-or-nothing gang scheduling, two simultaneous 1-GPU SFT jobs.

Add optional remote_gym_url to NemoGymConfig. When set, the NemoGym Ray actor connects to an external Gym HTTP service instead of spawning local subprocesses. Colocated mode (default) is unchanged. - nemo_gym.py: split __init__ into remote/colocated paths - run_grpo_nemo_gym.py: support env.remote_gym_url and env.disagg_job_id - Gym submodule: standalone_server.py entry point with K8s endpoint registry integration, use_absolute_ip for cross-pod communication - gym_standalone_config.yaml: example config for standalone server Tested: disaggregated GRPO completed 3 training steps with RL on one RayCluster (2 GPU) and Gym on a separate RayCluster (CPU only).

…overy Each (RL, Gym) job pair shares a ConfigMap for dynamic address exchange. Both sides register their IP:port and poll for the peer's address. The ConfigMap has an ownerReference to the RL RayCluster for automatic garbage collection on teardown. - k8s_endpoint_registry.py: create/set/get/get_nowait methods with race condition handling (409 retry) and proper error propagation - endpoint-registry-rbac.yaml: ServiceAccount + Role + RoleBinding - disagg_rl_raycluster.yaml: RL cluster with serviceAccountName - disagg_gym_raycluster.yaml: Gym cluster with serviceAccountName Tested: ConfigMap CRUD verified in-cluster, bidirectional URL exchange between RL and Gym clusters confirmed working.

When RL and Gym run on separate RayClusters, either cluster failing or being deleted triggers teardown of both clusters to release resources. - peer-watcher.py: pure Python sidecar (no deps beyond stdlib), deployed as a ConfigMap volume mount on each head pod - Monitors peer RayCluster status via K8s API (polls every 10s) - Tears down after MAX_PEER_FAILURES (default 3) consecutive failures - Also monitors ConfigMap "error" key for application-level error signaling - Handles transient K8s API errors as failures (not false-healthy) - Added signal_error() to K8sEndpointRegistry - Updated disagg manifests with peer-watcher sidecar containers - Updated RBAC with "delete" verb for rayclusters Tested: deleting either cluster triggers teardown of both within ~10s.

…share configs - Kyverno policy: RayCluster/RayJob must have kai.scheduler/queue label. Validates at CRD level (not pod) since KubeRay operator creates pods. Optional Policy 2 for user→queue access control via ConfigMap. - kube-prometheus-stack: Prometheus + Grafana for fairshare monitoring. Pre-built Grafana dashboard showing GPU allocation vs fair share, preemption events, and scheduling latency per queue. - ServiceMonitors for KAI scheduler, binder, and queue-controller. - Example queue configs: - kai-queue.yaml: 2-GPU kind cluster (2 teams, equal quotas) - kai-queue-prod.yaml: 256-GPU prod (3 departments, 6 teams) - preemptMinRuntime: 4h (protect long training runs from priority preemption) - reclaimMinRuntime: 15m (fast fairness reclaim of over-quota resources) - SETUP.md: fairshare docs, preempt vs reclaim explanation, Grafana access. Tested: Kyverno rejects RayCluster without queue label, accepts with. Team A 2-GPU job reclaimed when Team B submitted to its guaranteed quota.

- Upgrade KAI scheduler v0.13.4 → v0.14.0 (adds Ray topology-aware scheduling, segment-size annotation support for PyTorchJob) - Update chart URL from NVIDIA/KAI-Scheduler to kai-scheduler/KAI-Scheduler - Fix Grafana dashboard metric names (add kai_ prefix to match actual Prometheus metric names). Verified: Grafana queries return live data. - New: extensions/k8s_cli/ — standalone Python CLI (pip installable): - nrl-k8s fairshare — show queue config (quota, limit, weight, priority) - nrl-k8s occupancy — show GPU allocation per node and per queue - nrl-k8s submit — submit gang-scheduled RayJob with optional --segment-size for topology-aware scheduling - 6 unit tests (mocked K8s API), all passing - Add TODO for NVL72 topology testing with links to relevant PRs/issues Tested: KAI v0.14.0 gang scheduling works, CLI commands verified against live cluster, Grafana dashboard loads and queries return data.

copy-pr-bot · 2026-04-09T06:37:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Merge disagg_rl_raycluster.yaml + disagg_gym_raycluster.yaml into single disagg-rayclusters.yaml (always deployed together) - Inline peer-watcher Python script directly in sidecar container args (eliminates ConfigMap setup step, each deployment is self-contained) - Remove 7 redundant workload YAMLs (sft_rayjob, kai_scheduled_*, raycluster-blocker, standalone peer-watcher.py) - Update SETUP.md: simplified quick start, updated architecture tree, removed ConfigMap peer-watcher setup step 15 files → 8 files in examples/. Infrastructure configs unchanged. Tested: inlined peer-watcher works — deleting either cluster triggers teardown of both within 10s.

- Remove extensions/k8s_cli/ (not needed for now) - Rename queues: org → root-org, priority-team → high-prio, community → low-prio - Simplify Kyverno policy comments - Update SETUP.md to remove CLI section Signed-off-by: Terry Kong <terryk@nvidia.com>

Not needed right now — queue enforcement can be added back later if required. Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Allows iterating on the server without committing to the Gym repo. The script imports from nemo_gym at runtime (same container image). Run with: uv run --extra nemo_gym python -m nemo_rl.distributed.standalone_gym_server Signed-off-by: Terry Kong <terryk@nvidia.com>

Three workload deployment patterns: 1. rayjob-monolithic.yaml — single-cluster RayJob (1 GPU, KubeRay) 2. disagg-rayclusters.yaml — two KubeRay RayClusters + peer-watcher 3. disagg-jobset.yaml — single JobSet with native failure/startup policies Also adds JobSet controller (v0.11.1) to the helmfile. Signed-off-by: Terry Kong <terryk@nvidia.com>

Shows the full Ray cluster pattern for Gym: separate head and worker pods within the JobSet, with dependsOn ordering and DNS discovery. Signed-off-by: Terry Kong <terryk@nvidia.com>

KAI gang-schedules all pods in a JobSet together (one PodGroup with minMember=total pods). This deadlocks with dependsOn: KAI waits for all pods to exist, but JobSet won't create dependent pods until the head is Ready. Fix: drop dependsOn, use init containers that poll ray health-check (same pattern KubeRay uses). Tested: all 6 pods schedule, init containers wait for heads, driver submits a Ray job successfully, successPolicy triggers on driver exit 0, failurePolicy tears down everything on gym-head crash. Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

- Add git safe.directory for Gym submodule (uv build fails otherwise) - Add uv pip install kubernetes (needed by endpoint registry) - Increase readiness probe failureThreshold (uv install takes time) Signed-off-by: Terry Kong <terryk@nvidia.com>

Replace --working-dir and --runtime-env-json with a simple cd wrapper. Both --working-dir and runtime_env.working_dir cause Ray to zip and upload the entire directory to GCS, which is extremely slow for large repos (1GB+). Since the code is already on all nodes via hostPath, wrapping the entrypoint with cd avoids the upload entirely. Before: ray job submit --working-dir /workspace/nemo-rl -- python ... → scans entire tree, uploads 1GB+ to GCS, takes minutes After: ray job submit -- bash -c "cd /workspace/nemo-rl && python ..." → instant submission, no upload Signed-off-by: Terry Kong <terryk@nvidia.com>

…pefail - Use timestamp-based submission_id to avoid GCS collision across redeploys - Disable wandb (not configured in kind dev cluster) - Add set -eo pipefail for proper exit code propagation through tee - Persist driver logs to hostPath for post-mortem debugging Tested: GRPO training loads config, connects to Ray cluster, loads datasets, initializes compute cluster. Fails with "Not enough GPUs" (expected — kind cluster has 2 GPUs, config expects 8). Signed-off-by: Terry Kong <terryk@nvidia.com>

- Add active development disclaimer (GitHub admonition) - Add production guidance (adapt manifests, use Terraform, not helmfile) - Document colocated vs disaggregated architecture with diagrams - Compare KubeRay RayClusters vs JobSet for disagg deployment - Explain why ConfigMap is still needed for JobSet (vLLM URL exchange) - Document dependsOn + KAI deadlock and init container workaround - Add local kind testing instructions - Add comparison table (failure cascading, gang scheduling, discovery) Signed-off-by: Terry Kong <terryk@nvidia.com>

Add all parallelism overrides needed for Qwen3-0.6B on a 2-GPU cluster: - tensor_model_parallel_size=1, pipeline=1, expert=1, context=1 - sequence_parallel=false (requires TP>1) - colocated.enabled=false, gpus_per_node=1 - max_new_tokens=512, max_total_sequence_length=512 - max_num_steps=2 for quick smoke testing Tested: GRPO initializes vLLM workers, captures CUDA graphs, starts Megatron LM workers. Fails at k8s_endpoint_registry import because the container image predates the tk/infra branch — the hostPath mount has newer code than the baked-in worker venvs. Will work with a matching container build. Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong added 6 commits March 30, 2026 23:36

terrykong force-pushed the tk/infra branch from 3c39ad9 to 9f934f0 Compare April 15, 2026 04:27

terrykong added 14 commits April 17, 2026 19:26

infra: remove Kyverno

d5c04e0

Not needed right now — queue enforcement can be added back later if required. Signed-off-by: Terry Kong <terryk@nvidia.com>

infra: remove KAI Grafana dashboard

ea33b19

Signed-off-by: Terry Kong <terryk@nvidia.com>

infra: remove Prometheus+Grafana stack

c73676a

Signed-off-by: Terry Kong <terryk@nvidia.com>

infra: add gym-head + gym-workers to JobSet example

267c2b0

Shows the full Ray cluster pattern for Gym: separate head and worker pods within the JobSet, with dependsOn ordering and DNS discovery. Signed-off-by: Terry Kong <terryk@nvidia.com>

infra: use uv run --extra nemo_gym for gym-head in JobSet

69e04ef

Signed-off-by: Terry Kong <terryk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tk/infra#2238

Tk/infra#2238
terrykong wants to merge 21 commits intomainfrom
tk/infra

terrykong commented Apr 9, 2026

Uh oh!

copy-pr-bot bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

terrykong commented Apr 9, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant