Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions docs/hpc/06_tools_and_software/08_utils.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,14 @@ It will provide detailed information like:
- device details like power consumption and temperature

:::tip
To monitor running jobs, use:
You can get output refreshed every 5 seconds with:
```bash
[NetID@gl001 ~]$ watch -n 1 nvidia-smi
nvidia-smi -l 5
```
Alternatively, you can use:
```bash
/share/apps/images/run-nvtop-3.2.0.bash nvtop
```
on the compute node
:::

:::tip
Expand Down
8 changes: 4 additions & 4 deletions docs/hpc/08_ml_ai_hpc/02_pytorch_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ We also need to add the profiler `kernprof` (in the line_profiler package) to th
[NetID@log-1 ~]$ srun --pty -c 2 --mem=5GB /bin/bash
[NetID@cm001 ~]$ singularity exec \
--overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:rw \
/scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \
/share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \
/bin/bash
Singularity> source /ext3/env.sh
Singularity> pip install line_profiler
Expand Down Expand Up @@ -62,7 +62,7 @@ Let's first switch to a compute node, so we don't overly tax our login node:

This uses the `torchvision` package, so you'll need to run it with our overlay file:
```bash
[NetID@cm001 pytorch_single_gpu]$ singularity exec --nv --overlay /scratch/rjy1/pytorch-example/my_pytorch.ext3:ro /scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash -c "source /ext3/env.sh; python download_data.py"
[NetID@cm001 pytorch_single_gpu]$ singularity exec --nv --overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:ro /share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif /bin/bash -c "source /ext3/env.sh; python download_data.py"
```
After running the command above you should see that it has created a subdirectory named `data` that contains the data we'll use in this example.

Expand Down Expand Up @@ -243,7 +243,7 @@ module purge

srun singularity exec --nv \
--overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:ro \
/scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif\
/share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif\
/bin/bash -c "source /ext3/env.sh; kernprof -o ${SLURM_JOBID}.lprof -l mnist_classify.py --epochs=3"
```

Expand Down Expand Up @@ -279,7 +279,7 @@ We installed [line_profiler](https://researchcomputing.princeton.edu/python-prof
[NetID@log-1 ~]$ srun --pty -c 2 --mem=5GB /bin/bash
[NetID@cm001 ~]$ singularity exec \
--overlay /scratch/NetID/pytorch-example/my_pytorch.ext3:rw \
/scratch/work/public/singularity/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \
/share/apps/images/cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif \
/bin/bash -c "source /ext3/env.sh; python -m line_profiler -rmt *.lprof"
Timer unit: 1e-06 s

Expand Down
2 changes: 1 addition & 1 deletion docs/hpc/08_ml_ai_hpc/05_llm_fine_tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ This section provides a comprehensive overview of all environment-related issues

|Problem|Symptom|Cause|Resolution|
|---|---|---|---|
|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /scratch/work/public/overlay-fs-ext3/` to verify the correct file: `overlay-50G-10M.ext3.gz`|
|Incorrect overlay filename|No such file: `overlay-50GB-500K.ext3.gz`|The filename was incorrectly assumed|Use `ls /share/apps/images` to verify the correct file: `overlay-50G-10M.ext3.gz`|
|Compressed overlay used directly|`FATAL: while loading overlay images...`|Attempted to use `.gz` file directly with Singularity|Run `gunzip overlay-50G-10M.ext3.gz` before using the file|
|Overlay missing in working directory|sbatch cannot find the overlay file|Overlay not copied to the training directory|Ensure the overlay file is placed in `/scratch/<NetID>/fine-tune/` where sbatch accesses it|
|Invalid overlay structure|`FATAL: could not create upper dir`|Overlay created via `fallocate` + `mkfs.ext3`, missing necessary internal structure|Always use `singularity overlay create --size 25000` to create overlays|
Expand Down
Loading