HPC Clusters: How to run jobs

Body


SSH into a login node

There is a single login node used to access all cluster nodes. All HPC jobs must be started from this node.

Login procedure

Use SSH to login to the cluster login node:

ssh epyc.simcenter.utc.edu


Submitting Slurm jobs

The best way to start a job is through a job submission script. This script will define all the parameters needed for the job, including run time, number of CPUs, number of GPUs, partition name, etc. Submitting jobs in this manner will the allow resources used to automatically be made available to the next user as soon as the code is finished running.

Here is an example submission script:

#!/bin/bash -l
 
#SBATCH --partition=partition_name    # Partition name
#SBATCH --job-name=my_job             # Job name
#SBATCH --output=output.txt           # Output text file
#SBATCH --error=error.txt             # Error text file
#SBATCH --nodes=1                     # Number of nodes
#SBATCH --ntasks-per-node=1           # Number of tasks per node
#SBATCH --cpus-per-task=1             # Number of CPU cores per task
#SBATCH --gpus-per-node=1             # Number of GPUs per node
#SBATCH --time=0-2:00:00              # Maximum runtime (D-HH:MM:SS)
#SBATCH --mail-type=END               # Send email at job completion
#SBATCH --mail-user=email-addr        # Email address for notifications
 

# load environment modules, if needed

source /etc/profile.d/modules.sh
 
module load openmpi
 
# Application execution 
# You can either run jobs directly, with srun, or with MPI.  Below are examples of each, only use one.
 
# direct example
python example.py
 
# srun example
srun <application> <command line arguments>
 
# MPI example for MPI programs
mpiexec application command line arguments

Submit the job to the cluster scheduler with sbatch job_script.sh

Exhaustive description of the “sbatch” command can be found in the Official Documentation.


Interactive Slurm Jobs

It is possible to launch a shell on a compute node to then interactively run your code on that node. This method is only recommended if you are watching the progress so when it completes you can release the resources by exiting the shell. If you walk away, the idle bash shell will consume resources and keep others from being able to use them.

To launch a shell on a node, the “srun” command is used. An example interactive job request is below:

To run an interactive job using GPUs:

srun --x11 --time=1-00:00:00 --partition=gpu --gres=gpu:1 --ntasks=4 --pty /bin/bash -l

To run an interactive job without GPUs:

srun --x11 --time=1-00:00:00 --partition=general --ntasks=120 --pty /bin/bash -l

Exhaustive description of “srun” command can be found in the Official Documentation


Open OnDemand

Documentation in progress, for now, here's a link to it:

https://utc-ondemand.research.utc.edu/


Details

Details

Article ID: 163831
Created
Thu 9/5/24 7:48 PM
Modified
Fri 11/1/24 10:44 AM