Toolchain

How can I interact with my workload globally on the system?

Creating wrapper scripts in your PATH makes it easier to interact with your project from anywhere on the system. For research and development, this often makes it easier to interact with your workload without having to navigate to specific directories or hardcoding paths in job scripts. This way you’ll be able to interact with your project from anywhere with something like @workload train --help for example.

Creating Wrapper Scripts

Store executable scripts in directories like ~/bin or ~/.local/bin that are in your PATH. Consider prefixing with @ to indicate it’s a wrapper script rather than an official package.

#!/bin/bash

# Wrapper script for project workload
# Runs the specified entrypoint from your project directory

uv run --directory ${HOME}/my-project workload "$@"

The "$@" is used to pass all the arguments to the entrypoint, so you can run @workload train --epochs 100 --batch-size 32 and those arguments will be forwarded to your training script.

How do I keep a consistent environment and toolchain configuration across different machines?

Having a well-defined toolchain makes it alot easier to have a consistent experience across different machines which will make it more engaging and easier to use HPCs in general.

A flexible and common way of managing dotfiles is to use a symlink farm/manager like stow. The idea is that you have a repo with all of your dotfiles and then you can use stow to symlink the files to the correct location.

To get started initialize a new directory named dotfiles in your home directory and run git init to initialize the repo. Then begin copying the original dotfiles and/or directories such as ~/.bashrc, ~/.ssh/config, ~/.config/wandb and so on.

From here you can run stow to symlink the files to the correct location.

# From anywhere in the terminal
stow -v <package> $HOME

# From within the dotfiles repo
cd ~/.dotfiles && stow .

How do I access services on the remote machine?

To access services running on remote HPC nodes, create an SSH tunnel to forward the remote port to your local machine. One example could be running a Jupyter notebook server alongside an Ollama inference service, allowing you to interact with AI models directly from your notebook.

LSF Job Submission Script

#!/bin/bash

#BSUB -J services                 # Job name
#BSUB -n 4                        # Number of CPU cores
#BSUB -W 7:00                     # Walltime (7 hours)
#BSUB -q p1                       # Queue name
#BSUB -R "rusage[mem=8GB]"        # Memory requirement
#BSUB -R "span[hosts=1]"          # Ensure all cores on one host
#BSUB -gpu "num=1:mode=exclusive" # GPU requirement
#BSUB -u your@email.com           # Email for notifications
#BSUB -B                          # Send email at job begin
#BSUB -N                          # Send email at job end
#BSUB -o %J.out                   # Standard output
#BSUB -e %J.err                   # Standard error

./entrypoint.sh

#!/usr/bin/env bash
set -euo pipefail

: "${NOTEBOOK_APP_IP:=0.0.0.0}"
: "${NOTEBOOK_APP_PORT:=8888}"
: "${NOTEBOOK_APP_OPEN_BROWSER:=False}"

# Start Jupyter service
jupyter notebook --ip="$NOTEBOOK_APP_IP" --port="$NOTEBOOK_APP_PORT" --no-browser="$NOTEBOOK_APP_OPEN_BROWSER" &
JUPYTER_PID=$!

wait $JUPYTER_PID

SSH Tunnel

Create the tunnel to access services locally:

# Forward both ports (assumes SSH host 'p1-hpc' in ~/.ssh/config)
ssh -L 8888:localhost:8888 p1-hpc -N -f

# Access service
# Jupyter: http://localhost:8888

Or add the following to your ~/.ssh/config:

Host p1-hpc
  # ...
  LocalForward 8888 localhost:8888