Skip to content

scottworkman/hpc-bridge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

HPC Node SSH Access

This repository provides scripts to start a user-level SSH server on an allocated compute node (Slurm, PBS, or LSF). This enables direct SSH access to the node from your local machine, tunneled through the cluster's login node.

Why use this?

  1. Network Access: On many high-performance computing (HPC) clusters, compute nodes are isolated on a private network and cannot be accessed directly from the internet.
  2. Resource Limits: Standard SSH sessions often bypass the scheduler's resource limits (cgroups). This means your processes might escape the resource limits (CPUs, GPUs, Memory) you requested, potentially disrupting other users on the node.

By running a user-level sshd instance within your job allocation, you ensure that your session respects the scheduler's resource limits while gaining full network access via the login node.

This setup is particularly useful for:

  • VS Code Remote - SSH: Run the full VS Code server (or its variants like VSCodium, Cursor) on the compute node. This gives you a seamless development environment with access to extensions, debugger, and the node's hardware (e.g., GPUs) without fighting for resources on the login node.
  • Port Forwarding: Easily tunnel web UIs like Jupyter, TensorBoard, or Dask dashboards to your local browser.
  • File Transfer: Use standard SFTP tools to move data directly to/from the node's local scratch space.
  • Workflow Flexibility: Use any SSH-compatible tool (JetBrains Gateway, Neovim, Emacs TRAMP, rsync) as if the node were a local server on your network.

Overview

The process involves:

  1. Allocating a compute node via your workload manager (Slurm, PBS, etc.).
  2. Starting a user-level SSH daemon (sshd) on the allocated node using start_ssh.sh.
  3. Connecting from your local machine, typically tunneling through the login node.
  4. Cleaning up with teardown_ssh.sh when finished.

Prerequisites

  • Access to an HPC cluster (Slurm, PBS, or LSF). Note: This project has primarily been tested with Slurm.
  • SSH public key set up on the login node for passwordless access (recommended, see Setting Up SSH Keys).
  • The start_ssh.sh and teardown_ssh.sh scripts available on the cluster (e.g., in your home directory or project space).

A Note on Shared Filesystems: These scripts assume your home directory (~) is shared between the login node and the allocated compute node. If your cluster does not have a shared home directory, please see the Manual Setup for Non-Shared Home Directories section below.

Installation

Clone the repository to your home directory (or project space) on the cluster:

git clone https://github.com/scottworkman/hpc-bridge.git
cd hpc-bridge

Usage

1. Allocate a Node

First, request an interactive allocation on the cluster using your scheduler.

Slurm Example:

salloc --gres=gpu:1 --nodelist=node-01

Once the allocation is granted, ensure you are running a shell on the allocated node. Depending on your cluster's configuration, the allocation command might leave you on the login node. You can verify this by checking the hostname. If you are not on the compute node, shell into it using your scheduler's command (e.g., for slurm: srun --pty bash).

Other Schedulers

Note: These scripts support Slurm, PBS, and LSF by detecting their respective Job ID environment variables. However, due to lack of access to PBS or LSF environments, the specific allocation commands below are untested and provided as a reference only.

PBS:

qsub -I -l select=1:ngpus=1:host=node-01

LSF:

bsub -Is -m "node-01" -gpu "num=1" /bin/bash

2. Start the SSH Server

Run the start script to initialize the SSH server. This must be run inside the job allocation.

./start_ssh.sh

What this script does:

  • Verifies it is running inside a valid job allocation (checks SLURM_JOB_ID, PBS_JOBID, or LSB_JOBID).
  • Generates a persistent host key at ~/.ssh/hpc_sshd_ed25519_key (if it doesn't exist). This prevents "Remote Host Identification Changed" warnings when connecting to different nodes over time.
  • Creates a session-specific temporary directory /tmp/sshd_${USER}_${JOB_ID}.
  • Starts a private sshd instance listening on port 2222 (default). Can be overridden by passing a port number (e.g., ./start_ssh.sh 2223).
  • Disables password authentication, PAM, and root login for safety.
  • Enables SFTP subsystem for file transfer.

3. Connect from Local Machine

To connect, configure your local SSH client to proxy the connection through the login node.

Add the following to your local ~/.ssh/config:

Host hpc-login                   # Your standard login node config (likely already exists)
    HostName login.example.com
    User YOUR_USERNAME

Host hpc-bridge
    HostName NODE_NAME           # Replace with the actual hostname (e.g., node-01)
    User YOUR_USERNAME
    Port 2222                    # Or the custom port you specified
    ProxyJump hpc-login          # Jump through the login node

    # Optional: Forward ports (e.g., Jupyter, TensorBoard)
    LocalForward 8888 localhost:8888
    LocalForward 6006 localhost:6006

Note: You will need to update the HostName (i.e., replace NODE_NAME with your actual compute node name) in your config each time you get a different allocation.

Connecting:

Terminal:

ssh hpc-bridge

VS Code:

  1. Open VS Code.
  2. Open the Command Palette (Ctrl+Shift+P / Cmd+Shift+P).
  3. Select Remote-SSH: Connect to Host....
  4. Choose hpc-bridge.

4. Teardown

When you are finished with your session, run the teardown script on the compute node to stop the daemon and clean up files.

./teardown_ssh.sh

What this script does:

  • Identifies the sshd process associated with your current Job ID.
  • Kills the process.
  • Removes the temporary configuration files and logs from /tmp.

5. Release Allocation

Finally, release the compute node allocation if you are done with it.

  1. Exit the shell on the compute node:

    exit
  2. Cancel the job from the login node if necessary:

    Slurm:

    scancel <JOB_ID>

    Other Schedulers:

    PBS:

    qdel <JOB_ID>

    LSF:

    bkill <JOB_ID>

Troubleshooting

  • Port Conflicts: The scripts default to port 2222. If you are on a shared node where another user is already using port 2222, you can start the server on a different port:
    ./start_ssh.sh 2223
  • Permissions: Ensure both scripts are executable:
    chmod +x start_ssh.sh teardown_ssh.sh

Appendix

Setting Up SSH Keys

If you haven't set up SSH keys yet, it is recommended to use the Ed25519 algorithm, which is more secure and faster than older RSA keys.

  1. Generate the key pair on your local machine:

    ssh-keygen -t ed25519

    Press Enter to save to the default location (~/.ssh/id_ed25519). You can choose to add a passphrase for extra security.

  2. Copy the public key to the cluster: Use ssh-copy-id to install your new public key on the login node.

    ssh-copy-id user@login.example.com

    (Or manually append the content of ~/.ssh/id_ed25519.pub to ~/.ssh/authorized_keys on the login node).

Manual Setup for Non-Shared Home Directories

If your cluster does not have a shared home directory, you must manually ensure that both your authorized_keys (for you to log in) and the Host Key (for the server to identify itself) are present in ~/.ssh on the compute node. Note: This workflow has not been explicitly tested, but the process should look like this:

  1. Prepare on Login Node: Run this command on the login node to prepare the host key (if not already present).

    ssh-keygen -t ed25519 -f ~/.ssh/hpc_sshd_ed25519_key -N ""
  2. Broadcast Files (Login Node -> Compute Node): After allocating a job, run these commands from the login node shell to setup the compute node environment and push your keys/scripts. Important: If your allocation command put you directly on the compute node (e.g., qsub -I), you will need to open a new terminal window, SSH into the login node, and run these commands from there.

    Note: srun (Slurm) is used here as an example. For PBS or LSF, use the equivalent file broadcast command provided by your scheduler, or scp the files manually.

    # 1. Create directory on compute node
    srun mkdir -p ~/.ssh
    srun chmod 700 ~/.ssh
    
    # 2. Broadcast credentials
    # Use existing authorized_keys (from login node) so you can ProxyJump from your local machine
    sbcast ~/.ssh/authorized_keys ~/.ssh/authorized_keys
    srun chmod 600 ~/.ssh/authorized_keys
    
    # Copy over the persistent host key
    sbcast ~/.ssh/hpc_sshd_ed25519_key ~/.ssh/hpc_sshd_ed25519_key
    srun chmod 600 ~/.ssh/hpc_sshd_ed25519_key
    
    # 3. Broadcast scripts
    sbcast start_ssh.sh start_ssh.sh
    sbcast teardown_ssh.sh teardown_ssh.sh
    srun chmod +x start_ssh.sh teardown_ssh.sh
  3. Connect and Start: Now you can start your interactive shell on the compute node:

    srun --pty bash
    ./start_ssh.sh

By placing the file at ~/.ssh/hpc_sshd_ed25519_key (the default path in start_ssh.sh), the script will see it exists and use it instead of generating a new random one.

License & Attribution

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages