This repository provides scripts to start a user-level SSH server on an allocated compute node (Slurm, PBS, or LSF). This enables direct SSH access to the node from your local machine, tunneled through the cluster's login node.
Why use this?
- Network Access: On many high-performance computing (HPC) clusters, compute nodes are isolated on a private network and cannot be accessed directly from the internet.
- Resource Limits: Standard SSH sessions often bypass the scheduler's resource limits (cgroups). This means your processes might escape the resource limits (CPUs, GPUs, Memory) you requested, potentially disrupting other users on the node.
By running a user-level sshd instance within your job allocation, you ensure that your session respects the scheduler's resource limits while gaining full network access via the login node.
This setup is particularly useful for:
- VS Code Remote - SSH: Run the full VS Code server (or its variants like VSCodium, Cursor) on the compute node. This gives you a seamless development environment with access to extensions, debugger, and the node's hardware (e.g., GPUs) without fighting for resources on the login node.
- Port Forwarding: Easily tunnel web UIs like Jupyter, TensorBoard, or Dask dashboards to your local browser.
- File Transfer: Use standard SFTP tools to move data directly to/from the node's local scratch space.
- Workflow Flexibility: Use any SSH-compatible tool (JetBrains Gateway, Neovim, Emacs TRAMP, rsync) as if the node were a local server on your network.
The process involves:
- Allocating a compute node via your workload manager (Slurm, PBS, etc.).
- Starting a user-level SSH daemon (
sshd) on the allocated node usingstart_ssh.sh. - Connecting from your local machine, typically tunneling through the login node.
- Cleaning up with
teardown_ssh.shwhen finished.
- Access to an HPC cluster (Slurm, PBS, or LSF). Note: This project has primarily been tested with Slurm.
- SSH public key set up on the login node for passwordless access (recommended, see Setting Up SSH Keys).
- The
start_ssh.shandteardown_ssh.shscripts available on the cluster (e.g., in your home directory or project space).
A Note on Shared Filesystems: These scripts assume your home directory (
~) is shared between the login node and the allocated compute node. If your cluster does not have a shared home directory, please see the Manual Setup for Non-Shared Home Directories section below.
Clone the repository to your home directory (or project space) on the cluster:
git clone https://github.com/scottworkman/hpc-bridge.git
cd hpc-bridgeFirst, request an interactive allocation on the cluster using your scheduler.
Slurm Example:
salloc --gres=gpu:1 --nodelist=node-01Once the allocation is granted, ensure you are running a shell on the allocated node. Depending on your cluster's configuration, the allocation command might leave you on the login node. You can verify this by checking the hostname. If you are not on the compute node, shell into it using your scheduler's command (e.g., for slurm: srun --pty bash).
Note: These scripts support Slurm, PBS, and LSF by detecting their respective Job ID environment variables. However, due to lack of access to PBS or LSF environments, the specific allocation commands below are untested and provided as a reference only.
PBS:
qsub -I -l select=1:ngpus=1:host=node-01LSF:
bsub -Is -m "node-01" -gpu "num=1" /bin/bashRun the start script to initialize the SSH server. This must be run inside the job allocation.
./start_ssh.shWhat this script does:
- Verifies it is running inside a valid job allocation (checks
SLURM_JOB_ID,PBS_JOBID, orLSB_JOBID). - Generates a persistent host key at
~/.ssh/hpc_sshd_ed25519_key(if it doesn't exist). This prevents "Remote Host Identification Changed" warnings when connecting to different nodes over time. - Creates a session-specific temporary directory
/tmp/sshd_${USER}_${JOB_ID}. - Starts a private
sshdinstance listening on port 2222 (default). Can be overridden by passing a port number (e.g.,./start_ssh.sh 2223). - Disables password authentication, PAM, and root login for safety.
- Enables SFTP subsystem for file transfer.
To connect, configure your local SSH client to proxy the connection through the login node.
Add the following to your local ~/.ssh/config:
Host hpc-login # Your standard login node config (likely already exists)
HostName login.example.com
User YOUR_USERNAME
Host hpc-bridge
HostName NODE_NAME # Replace with the actual hostname (e.g., node-01)
User YOUR_USERNAME
Port 2222 # Or the custom port you specified
ProxyJump hpc-login # Jump through the login node
# Optional: Forward ports (e.g., Jupyter, TensorBoard)
LocalForward 8888 localhost:8888
LocalForward 6006 localhost:6006
Note: You will need to update the HostName (i.e., replace NODE_NAME with your actual compute node name) in your config each time you get a different allocation.
Connecting:
Terminal:
ssh hpc-bridgeVS Code:
- Open VS Code.
- Open the Command Palette (
Ctrl+Shift+P/Cmd+Shift+P). - Select Remote-SSH: Connect to Host....
- Choose
hpc-bridge.
When you are finished with your session, run the teardown script on the compute node to stop the daemon and clean up files.
./teardown_ssh.shWhat this script does:
- Identifies the
sshdprocess associated with your current Job ID. - Kills the process.
- Removes the temporary configuration files and logs from
/tmp.
Finally, release the compute node allocation if you are done with it.
-
Exit the shell on the compute node:
exit -
Cancel the job from the login node if necessary:
Slurm:
scancel <JOB_ID>
Other Schedulers:
PBS:
qdel <JOB_ID>
LSF:
bkill <JOB_ID>
- Port Conflicts: The scripts default to port
2222. If you are on a shared node where another user is already using port 2222, you can start the server on a different port:./start_ssh.sh 2223
- Permissions: Ensure both scripts are executable:
chmod +x start_ssh.sh teardown_ssh.sh
If you haven't set up SSH keys yet, it is recommended to use the Ed25519 algorithm, which is more secure and faster than older RSA keys.
-
Generate the key pair on your local machine:
ssh-keygen -t ed25519
Press Enter to save to the default location (
~/.ssh/id_ed25519). You can choose to add a passphrase for extra security. -
Copy the public key to the cluster: Use
ssh-copy-idto install your new public key on the login node.ssh-copy-id user@login.example.com
(Or manually append the content of
~/.ssh/id_ed25519.pubto~/.ssh/authorized_keyson the login node).
If your cluster does not have a shared home directory, you must manually ensure that both your authorized_keys (for you to log in) and the Host Key (for the server to identify itself) are present in ~/.ssh on the compute node. Note: This workflow has not been explicitly tested, but the process should look like this:
-
Prepare on Login Node: Run this command on the login node to prepare the host key (if not already present).
ssh-keygen -t ed25519 -f ~/.ssh/hpc_sshd_ed25519_key -N ""
-
Broadcast Files (Login Node -> Compute Node): After allocating a job, run these commands from the login node shell to setup the compute node environment and push your keys/scripts. Important: If your allocation command put you directly on the compute node (e.g.,
qsub -I), you will need to open a new terminal window, SSH into the login node, and run these commands from there.Note:
srun(Slurm) is used here as an example. For PBS or LSF, use the equivalent file broadcast command provided by your scheduler, orscpthe files manually.# 1. Create directory on compute node srun mkdir -p ~/.ssh srun chmod 700 ~/.ssh # 2. Broadcast credentials # Use existing authorized_keys (from login node) so you can ProxyJump from your local machine sbcast ~/.ssh/authorized_keys ~/.ssh/authorized_keys srun chmod 600 ~/.ssh/authorized_keys # Copy over the persistent host key sbcast ~/.ssh/hpc_sshd_ed25519_key ~/.ssh/hpc_sshd_ed25519_key srun chmod 600 ~/.ssh/hpc_sshd_ed25519_key # 3. Broadcast scripts sbcast start_ssh.sh start_ssh.sh sbcast teardown_ssh.sh teardown_ssh.sh srun chmod +x start_ssh.sh teardown_ssh.sh
-
Connect and Start: Now you can start your interactive shell on the compute node:
srun --pty bash ./start_ssh.sh
By placing the file at ~/.ssh/hpc_sshd_ed25519_key (the default path in start_ssh.sh), the script will see it exists and use it instead of generating a new random one.
This project is licensed under the MIT License - see the LICENSE file for details.