From b3a9f30e6da76210ab86a204c78d94735db79dba Mon Sep 17 00:00:00 2001 From: Robert Young Date: Mon, 20 Apr 2026 10:45:43 -0400 Subject: [PATCH 1/5] netid fix --- docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md b/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md index 7d66d6888a..bf498dbe6f 100644 --- a/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md +++ b/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md @@ -62,7 +62,7 @@ Let's first switch to a compute node, so we don't overly tax our login node: This uses the `torchvision` package, so you'll need to run it with our overlay file: ```bash -[NetID@cm001 pytorch_single_gpu]$ singularity exec --nv --overlay /scratch/rjy1/pytorch-example/my_pytorch.ext3:ro /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash -c "source /ext3/env.sh; python download_data.py" +[NetID@cm001 pytorch_single_gpu]$ singularity exec --nv --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:ro /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash -c "source /ext3/env.sh; python download_data.py" ``` After running the command above you should see that it has created a subdirectory named `data` that contains the data we'll use in this example. From 1cada6cbb8c89ad1aacb28c5443d52ab1c1a8c60 Mon Sep 17 00:00:00 2001 From: Robert Young Date: Mon, 20 Apr 2026 10:55:31 -0400 Subject: [PATCH 2/5] updated image paths; fixed netid --- docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md | 8 ++++---- docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md b/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md index bf498dbe6f..e021706289 100644 --- a/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md +++ b/docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md @@ -31,7 +31,7 @@ We also need to add the profiler `kernprof` (in the line_profiler package) to th [NetID@log-1 ~]$ srun --pty -c 2 --mem=5GB /bin/bash [NetID@cm001 ~]$ singularity exec \ --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:rw \ - /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \ + /share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \ /bin/bash Singularity> source /ext3/env.sh Singularity> pip install line_profiler @@ -62,7 +62,7 @@ Let's first switch to a compute node, so we don't overly tax our login node: This uses the `torchvision` package, so you'll need to run it with our overlay file: ```bash -[NetID@cm001 pytorch_single_gpu]$ singularity exec --nv --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:ro /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash -c "source /ext3/env.sh; python download_data.py" +[NetID@cm001 pytorch_single_gpu]$ singularity exec --nv --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:ro /share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash -c "source /ext3/env.sh; python download_data.py" ``` After running the command above you should see that it has created a subdirectory named `data` that contains the data we'll use in this example. @@ -243,7 +243,7 @@ module purge srun singularity exec --nv \ --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:ro \ - /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif\ + /share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif\ /bin/bash -c "source /ext3/env.sh; kernprof -o ${SLURM_JOBID}.lprof -l mnist_classify.py --epochs=3" ``` @@ -279,7 +279,7 @@ We installed [line_profiler](https://researchcomputing.princeton.edu/python-prof [NetID@log-1 ~]$ srun --pty -c 2 --mem=5GB /bin/bash [NetID@cm001 ~]$ singularity exec \ --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:rw \ - /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \ + /share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \ /bin/bash -c "source /ext3/env.sh; python -m line_profiler -rmt *.lprof" Timer unit: 1e-06 s diff --git a/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md b/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md index 629b443855..1edc87cfae 100644 --- a/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md +++ b/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md @@ -129,7 +129,7 @@ This section provides a comprehensive overview of all environment-related issues |Problem|Symptom|Cause|Resolution| |---|---|---|---| -|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /scratch/work/public/overlay-fs-ext3/` to verify the correct file: `overlay-50G-10M.ext3.gz`| +|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /share/apps/overlay-fs-ext3` to verify the correct file: `overlay-50G-10M.ext3.gz`| |Compressed overlay used directly|`FATAL: while loading overlay images...`|Attempted to use `.gz` file directly with Singularity|Run `gunzip overlay-50G-10M.ext3.gz` before using the file| |Overlay missing in working directory|sbatch cannot find the overlay file|Overlay not copied to the training directory|Ensure the overlay file is placed in `/scratch//fine-tune/` where sbatch accesses it| |Invalid overlay structure|`FATAL: could not create upper dir`|Overlay created via `fallocate` + `mkfs.ext3`, missing necessary internal structure|Always use `singularity overlay create --size 25000` to create overlays| From c4a2c32f39db26663e353ab8c407f788fa9a9a11 Mon Sep 17 00:00:00 2001 From: Robert Young Date: Mon, 20 Apr 2026 12:17:37 -0400 Subject: [PATCH 3/5] path fix --- docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md b/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md index 1edc87cfae..35c5baa75f 100644 --- a/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md +++ b/docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md @@ -129,7 +129,7 @@ This section provides a comprehensive overview of all environment-related issues |Problem|Symptom|Cause|Resolution| |---|---|---|---| -|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /share/apps/overlay-fs-ext3` to verify the correct file: `overlay-50G-10M.ext3.gz`| +|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /share/apps/images` to verify the correct file: `overlay-50G-10M.ext3.gz`| |Compressed overlay used directly|`FATAL: while loading overlay images...`|Attempted to use `.gz` file directly with Singularity|Run `gunzip overlay-50G-10M.ext3.gz` before using the file| |Overlay missing in working directory|sbatch cannot find the overlay file|Overlay not copied to the training directory|Ensure the overlay file is placed in `/scratch//fine-tune/` where sbatch accesses it| |Invalid overlay structure|`FATAL: could not create upper dir`|Overlay created via `fallocate` + `mkfs.ext3`, missing necessary internal structure|Always use `singularity overlay create --size 25000` to create overlays| From 059ec2cd8de8d7563861421728419c393e47f48d Mon Sep 17 00:00:00 2001 From: Robert Young Date: Mon, 20 Apr 2026 15:31:59 -0400 Subject: [PATCH 4/5] removed watch tip --- docs/hpc/06_tools_and_software/08_utils.mdx | 8 -------- 1 file changed, 8 deletions(-) diff --git a/docs/hpc/06_tools_and_software/08_utils.mdx b/docs/hpc/06_tools_and_software/08_utils.mdx index a6a837f93f..8972d16553 100644 --- a/docs/hpc/06_tools_and_software/08_utils.mdx +++ b/docs/hpc/06_tools_and_software/08_utils.mdx @@ -65,14 +65,6 @@ It will provide detailed information like: - P-States: Performance states from P0 (max performance) to P12 (minimum idle) - device details like power consumption and temperature -:::tip -To monitor running jobs, use: -```bash -[NetID@gl001 ~]$ watch -n 1 nvidia-smi -``` -on the compute node -::: - :::tip You can get very detailed information about the GPU with: ```bash From d1fa2f4768c05a3ca480c4179eb8f5daf56054ac Mon Sep 17 00:00:00 2001 From: Robert Young Date: Tue, 21 Apr 2026 09:52:55 -0400 Subject: [PATCH 5/5] added new nvidia-smi tip from email --- docs/hpc/06_tools_and_software/08_utils.mdx | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/hpc/06_tools_and_software/08_utils.mdx b/docs/hpc/06_tools_and_software/08_utils.mdx index 8972d16553..bef0a79bf9 100644 --- a/docs/hpc/06_tools_and_software/08_utils.mdx +++ b/docs/hpc/06_tools_and_software/08_utils.mdx @@ -65,6 +65,17 @@ It will provide detailed information like: - P-States: Performance states from P0 (max performance) to P12 (minimum idle) - device details like power consumption and temperature +:::tip +You can get output refreshed every 5 seconds with: +```bash +nvidia-smi -l 5 +``` +Alternatively, you can use: +```bash +/share/apps/images/run-nvtop-3.2.0.bash nvtop +``` +::: + :::tip You can get very detailed information about the GPU with: ```bash