Introduction to LSU HPC

High-throughput data needs high-throughput analysis

Large volumes of data
Large number of chains or iterations for model simulations (theoretical or statistical)
Large number of repetitions

A commonality is an increase in the number of FLOPs, or “Floating point operations”.

HPCs are designed to facilitate such scaling, by maximizing FLOPS (FLOPs per Second)

Image from https://ourworldindata.org/grapher/supercomputer-power-flops?time=earliest..2024

HPCs can be hard, we are scratching the surface

When you use an HPC, you are sharing a huge computer with lots of other people

Important to know how the computer is set up
Important to know what the limitations and conventions are
Important to know how to keep a clean space

At LSU, we have access to several HPCs

LSU-HPC

SuperMike-III
SuperMIC
Deep Bayou

LONI (Louisiana Optical Network Infrastructure)

Operated by LA Board of Regents, accessible to various institutions within Louisiana

All researchers in the US also have access to ACCESS, and NSF-run HPC resource.

Allocations on the HPC

To use the HPC, each user account must have access to at least one computing allocation
- All of you are on the course allocation, but beyond the semester, ask your PI to create an allocation
- Each PI by default can have two active “startup” allocations (150,000 CPU hours), and can always submit a proposal for more.

How to log into the HPC?

Through an SSH connection in your terminal or through a new web-based interface
Connecting through the terminal is ideal for large, production-scale work, but the web interface is fantastic for a lot of prototyping (and maybe more, depending on your needs)
Terminal: through your terminal (mac/linux) or GitBash (Windows), you can access your LSU HPC account with: ssh <username>@mike.hpc.lsu.edu. (Needs 2FA setup).
Web-based interface: https://ondemand.mike.hpc.lsu.edu

What happens when you log into the HPC?

Logging in gets you onto a “login node” or a “head node”

You can think of this as a traffic controller – no ability to do any “real” jobs here, but you can ask for compute access
Demo: Terminal and Web-based

What happens when you log into the HPC?

Once you are logged in, you need to request access to a “compute node”

How much time do you want on the node?
What allocation should these hours be charged to?
How many nodes and how many cores do you need? (Depends on what kind of work you want to do!)
You can request an interactive node or a job node – pros and cons to each

Requesting compute access

Web access: simple on the portal

What happens when you log into a compute node?

Once you are in a login node, you can now run commands from the terminal or from the web interface to run code

Working with R, python, matlab, or paraview through the web-interface is straightforward, but can’t do everything through these…
You can also open the terminal through the web-browser (but some limitations)

File organization on the HPC

You can connect the HPC to git just as you did with your laptops
- ssh-keygen on the HPC terminal; copy pubkey into gitlab
- Useful if you want to prototype code on your laptop on a subset of the data and then ship it off to the HPC, or vice-versa

What if you want to do more?

Accessing the HPC through the terminal opens up more options
Upon logging in, you will be on the head node
We need to request access to a compute node to do any work:

salloc -A <name_of_allocation> -t H:MM:SS -p <queue> -N <nodes> -n <cores>

e.g. salloc -A hpc_reprores -t 1:00:00 -p single -N 1 -n 16

What if you want to do more?

Once on a compute node, you can load specific modules to achieve tasks
e.g. Bowtie, SamTools, RevBayes
Once a module is loaded, you can use it directly from the command line as you would on your on terminal.

What if you want to do more?

So far, we have covered Interactive sessions for interacting with the HPC
- This means you are sitting at the keyboard, telling the computer what to do after it completes each step
- Some limitations to this!
Instead, we can submit job batches

Batch job submissions

LSU HPC uses a job management program called SLURM (“Simple Linux Utility for Resource Management”) to manage its tasks
SLURM intelligently decides how to divide up available computing power among jobs, in the order they were submitted (or according to other priorities)
Users can submit a bash script that contains a list of tasks:

 #!/bin/bash
 #SBATCH -N 1               # request one node
 #SBATCH -t 2:00:00         # request two hours
 #SBATCH -p single          # in single partition (queue)
 #SBATCH -A your_allocation_name
 #SBATCH -o slurm-%j.out-%N # optional, name of the stdout, using the job number (%j) and the hostname of the node (%N)
 #SBATCH -e slurm-%j.err-%N # optional, name of the stderr, using job and hostname values
 # below are job commands
 date

 # Set some handy environment variables.

 export HOME_DIR=/home/$USER/myjob
 export WORK_DIR=/work/$USER/myjob
 
 # Make sure the WORK_DIR exists:
 mkdir -p $WORK_DIR
 # Copy files, jump to WORK_DIR, and execute a program called "mydemo"
 cp $HOME_DIR/mydemo $WORK_DIR
 cd $WORK_DIR
 ./mydemo
 # Mark the time it finishes.
 date
 # exit the job
 exit 0

Why use jobs?

You need not be at the computer – the program will work through its activities one by one

General workflow for HPC

Decide if your work needs the use of an HPC
Decide if you want to work interactively or through job submissions
Decide if the web-based portal is enough for your work, or if you need to turn to the terminal
Always test code on subsets of the data/iteratively
Think about “minimum working examples” or “minimum reproducible examples”

In computing, a minimal reproducible example (abbreviated MRE) is a collection of source code and other data files that allow a bug or problem to be demonstrated and reproduced. The important feature of a minimal reproducible example is that it is as small and as simple as possible, such that it is just sufficient to demonstrate the problem, but without any additional complexity or dependencies that will make resolution harder. Furthermore, it would probably expose the problem with as low as possible efforts and runtime to allow testing a new software version as efficiently as possible for the problem.