Sphyrna Job Scheduling

When you wish to use the Sphyrna research cluster, you must create a job and submit it to our job scheduler. The scheduler helps ensure fair access to the HPC cluster by scheduling resources efficiently across the system for simultaneous jobs. If CPU/IO-intensive jobs are not submitted through the job scheduler, they may be terminated.

The job scheduler we use is called Slurm. This software enables us to provide large-but-finite compute resources to a NSU campus research community.

Depending on how you wish to use the cluster, there are two basic categories for jobs:

Getting ready to submit:

Before submitting your job to the Slurm Scheduler, you need to do a bit of planning. This may involve trial-and-error, for which interactive jobs may be helpful. The three most salient variables are as follows:

How long your job needs to run - Choose this value carefully. If you underestimate this value, the system will terminate your job before it completes. If you overestimate this value, your job may wait in-queue for longer than necessary. Generally, it is better to overestimate than to underestimate.
How many compute cores and nodes your job will require - This is the level of parallelization. If your program supports running multiple threads or processes, you will need to determine how many cores to reserve to run most efficiently.
How much memory your job will need - By default, the Slurm scheduler allocates 3.9GB per compute core that you allocate. The default is enough for most job, but you can increase this value independently of the number of cores if your job requires more memory.
If your job will need access to special features - Your job may need access to GPUs or to nodes with specific processor models. You can specify these as constraints when you submit your job.

Example Slurm script:

1.GPU Partition SLURM submission template

#!/bin/bash
#SBATCH --job-name=gpu_job_
#SBATCH --partition=gpu
#SBATCH --nodes=1 # node count
#SBATCH --ntasks-per-node=1 # total number of tasks per node
#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=256G # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:2 # number of gpus per node
#SBATCH --time=1-10:00:00 # total run time limit (HH:MM:SS)
#SBATCH --error=gpu_job.%J.err
#SBATCH --output=gpu_job.%J.out

2.CPU Partition SLURM submission template

#!/bin/bash
#SBATCH --job-name=cpu_job_
#SBATCH --partition=cpu
#SBATCH --nodes=4 # node count
#SBATCH --ntasks-per-node=1 # total number of tasks per node
#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=256G # total memory per node (4 GB per cpu-core is default)
#SBATCH --time=1-10:00:00 # total run time limit (HH:MM:SS)
#SBATCH --error=cpu_job.%J.err
#SBATCH --output=cpu_job.%J.out

3.CPU+GPU Mix Partition SLURM submission template

#!/bin/bash
#SBATCH --job-name=mix_job_
#SBATCH --partition=mix
#SBATCH --nodes=5 # node count
#SBATCH --ntasks-per-node=1 # total number of tasks per node
#SBATCH --cpus-per-task=16 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=256G # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:2 # number of gpus per node
#SBATCH --time=1-10:00:00 # total run time limit (HH:MM:SS)
#SBATCH --error=mix_job.%J.err
#SBATCH --output=mix_job.%J.out

Slurm Command Reference

slurm command
Slurm Command Reference Command	Purpose	Example
sinfo	View information about Slurm nodes and partitions	sinfo --partition investor
squeue	View information about jobs	squeue -u myname
sbatch	Submit a batch script to Slurm	sbatch myjob
scancel	Signal or cancel jobs, job arrays or job steps	scancel jobID
srun	Run an interactive job	srun --ntasks 4 --partition investor --pty bash