GPUs, Parallel Processing, and Job Arrays
This page describes advanced capabilities of SLURM. For a basic introduction to SLURM, see SLURM: Scheduling and Managing Jobs.
Parallel Job Example Scripts
Below are example SLURM scripts for jobs employing parallel processing. In general, parallel jobs can be separated into four categories:
- Distributed memory programs that include explicit support for message passing between processes (e.g. MPI). These processes execute across multiple CPU cores and/or nodes.
- Multithreaded programs that include explicit support for shared memory processing via multiple threads of execution (e.g. Posix Threads or OpenMP) running across multiple CPU cores.
- Embarrassingly parallel analysis in which multiple instances of the same program execute on multiple data files simultaneously, with each instance running independently from others on its own allocated resources (i.e. CPUs and memory). SLURM job arrays offer a simple mechanism for achieving this.
- GPU (graphics processing unit) programs including explicit support for offloading to the device via languages like CUDA or OpenCL.
It is important to understand the capabilities and limitations of an application in order to fully leverage the parallel processing options available on the ACCRE cluster. For instance, many popular scientific computing languages like Python , R , and Matlab now offer packages that allow for GPU or multithreaded processing, especially for matrix and vector operations.
MPI Jobs
Jobs running MPI (Message Passing Interface) code require special attention within SLURM. SLURM allocates and launches MPI jobs differently depending on the version of MPI used (e.g. OpenMPI or Intel MPI). We recommend using the most recent version of OpenMPI or Intel MPI available through Lmod to compile code and then using SLURM’s srun
command to launch parallel MPI jobs. The example below runs MPI code compiled by GCC+ OpenMPI:
#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=3
#SBATCH --tasks-per-node=8 # 8 MPI processes per node
#SBATCH --time=7-00:00:00
#SBATCH --mem=4G # 4 GB RAM per node
#SBATCH --output=mpi_job_slurm.log
module load GCC OpenMPI
echo $SLURM_JOB_NODELIST
srun ./test # srun is SLURM's version of mpirun/mpiexec
This example requests 3 nodes and 8 tasks (i.e. processes) per node, for a total of 24 MPI tasks. By default, SLURM allocates 1 CPU core per process, so this job will run across 24 CPU cores. Note that srun
accepts many of the same arguments as mpirun / mpiexec (e.g. -n <number cpus>) but also allows increased flexibility for task affinity, memory, and many other features. Type man srun
for a list of options.
More information about running MPI jobs within SLURM can be found here here: http://slurm.schedmd.com/mpi_guide.html
Feel free to open a help desk ticket if you require assistance with your MPI job.
srun
This command is used to launch a parallel job step. Typically, srun
is invoked from a SLURM job script to launch a MPI job (much in the same way that mpirun
or mpiexec
are used). More details about running MPI jobs within SLURM are provided in the GPUs, Parallel Processing and Job Arrays section. Please note that your application must include MPI code in order to run in parallel across multiple CPU cores using srun
. Invoking srun
on a non-MPI command or executable will result in this program being independently run X times on each of the CPU cores in the allocation.
Alternatively, srun
can be run directly from the command line on a gateway, in which case srun
will first create a resource allocation for running the parallel job. The -n [CPU_CORES]
option is passed to specify the number of CPU cores for launching the parallel job step. For example, running the following command from the command line will obtain an allocation consisting of 16 CPU cores and then run the command hostname
across these cores:
srun -n 16 hostname
For more information about srun
see: http://www.schedmd.com/slurmdocs/srun.html
Multithreaded Jobs
Multithreaded programs are applications that are able to execute in parallel across multiple CPU cores within a single node using a shared memory execution model. In general, a multithreaded application uses a single process (i.e. “task” in SLURM) which then spawns multiple threads of execution. By default, SLURM allocates 1 CPU core per task. In order to make use of multiple CPU cores in a multithreaded program, one must include the --cpus-per-task
option. The ACCRE cluster features 8-core and 12-core nodes, so a user can request up to 12 CPU cores per task. Below is an example of a multithreaded program requesting 4 CPU cores per task. The program itself is responsible for spawning the appropriate number of threads.
#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4 # 4 threads per task
#SBATCH --time=02:00:00 # two hours
#SBATCH --mem=4G
#SBATCH --output=multithread.out
#SBATCH --job-name=multithreaded_example
# Load the most recent version of GCC available through Lmod
module load GCC
# Run multi-threaded application
./hello
Job Arrays
Job arrays are useful for submitting and managing a large number of similar jobs. As an example, job arrays are convenient if a user wishes to run the same analysis on 100 different files. SLURM provides job array environment variables that allow multiple versions of input files to be easily referenced. In the example below , three input files called vectorization_0.py
, vectorization_1.py
, and vectorization_2.py
are used as input for three independent Python jobs:
#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=2:00:00
#SBATCH --mem=2G
#SBATCH --array=0-2
#SBATCH --output=python_array_job_slurm_%A_%a.out
echo "SLURM_JOBID: " $SLURM_JOBID
echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
echo "SLURM_ARRAY_JOB_ID: " $SLURM_ARRAY_JOB_ID
# Load Anaconda distribution of Python
module load Anaconda2
python vectorization_${SLURM_ARRAY_TASK_ID}.py
The #SBATCH --array=0-2
line specifies the array size (3) and array indices (0, 1, and 2). These indices are referenced through the SLURM_ARRAY_TASK_ID
environment variable in the final line of the SLURM batch script to independently analyze the three input files. Each Python instance will receive its own resource allocation; in this case, each instance is allocated 1 CPU core (and 1 node), 2 hours of wall time, and 2 GB of RAM.
The --array=
option is flexible in terms of the index range and stride length. For instance, --array=0-10:2
would give indices of 0, 2, 4, 6, 8, and 10.
The %A
and %a
variables provide a method for directing standard output to separate files. %A
references the SLURM_ARRAY_JOB_ID
while %a
references SLURM_ARRAY_TASK_ID
. SLURM treats job ID information for job arrays in the following way: each task within the array has the same SLURM_ARRAY_JOB_ID
, and its own unique SLURM_JOBID
and SLURM_ARRAY_TASK_ID
. The JOBID shown from squeue
is formatted by SLURM_ARRAY_JOB_ID
followed by an underscore and the SLURM_ARRAY_TASK_ID
.
While the previous example provides a relatively simple method for running analyses in parallel, it can at times be inconvenient to rename files so that they may be easy indexed from within a job array. The following example provides a method for analyzing files with arbitrary file names, provided they are all stored in a sub-directory named data :
#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=2:00:00
#SBATCH --mem=2G
#SBATCH --array=1-5 # In this example we have 5 files to analyze
#SBATCH --output=python_array_job_slurm_%A_%a.out
arrayfile=`ls data/ | awk -v line=$SLURM_ARRAY_TASK_ID '{if (NR == line) print $0}'`
module load Anaconda2 # load Anaconda distribution of Python
python data/$arrayfile
More information can be found here: http://slurm.schedmd.com/job_array.html
GPU Jobs
ACCRE has 27 compute nodes equipped with Nvidia GPU cards for general-purpose GPU computing. The nodes are divided into two partitions depending on the type of GPU available on the node:
Partition | pascal | maxwell |
---|---|---|
number of nodes | 15 | 12 |
GPU | 4 x Nvidia Pascal | 4 x Nvidia Maxwell |
CPU cores | 8 (Intel Xeon E5-2623 v4) | 12 (Intel Xeon E5-2620 v3) |
CUDA cores (per GPU) | 3584 | 3072 |
host memory | 128 GB | 128 GB |
GPU memory | 12 GB | 12 GB |
network | 56 Gbps RoCE | 56 Gbps RoCE |
gres | 1 GPU + 2 CPUs | 1 GPU + 3 CPUs |
Know which account and partition to use
You must be a member of a GPU account in order to submit GPU jobs. ACCRE provides a command, slurm_groups
, that can tell you if GPU access is enabled for your login as well as which account and partition to use. See slurm_groups.
Running a GPU job
Users can request the desired amount of GPUs by using SLURM generic resources, also called gres. Each gres bundles together one GPU to multiple CPU cores (see table above) belonging to the same PCI Express root complex to minimize data transfer latency between host and GPU memory. The number of CPU cores requested cannot be higher than the sum of cores in the requested gres .
Below is an example SLURM script header to request 2 Pascal GPUs and 4 CPU cores on a single node on the pascal partition:
#!/bin/bash
#SBATCH --account=<your_gpu_account>
#SBATCH --partition=pascal
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem=20G
#SBATCH --time=2:00:00
#SBATCH --output=gpu-job.log
Note that you must be in one of the GPU groups on the cluster and specify this group from the job script in order to run jobs on the GPU cluster. The #SBATCH --partition=<pascal OR maxwell>
line is also required in the job script.
Several versions of the Nvidia CUDA API are available on the cluster and can be selected via Lmod:
[bob@vmps12]$ module spider cuda
[...]
[bob@vmps12]$ module load CUDA/8.0.61
There are currently a handful applications available that allow you to leverage the low-latency RoCE network available on the maxwell
and pascal
partitions. Note that both GPU partitions are intended for GPU jobs only, so users are not allowed to run purely CPU programs. The GPU nodes (both the maxwell
and pascal
partitions) support serial CPU execution as well as parallel CPU execution using either a multi-threaded, shared memory model (e.g. with OpenMP) or a multi-process, distributed memory execution (i.e. with MPI). Two flavors of RoCE-enabled MPI are available on the cluster, as well as Gromacs and HOOMD-Blue.
All jobs making use of a RoCE-enabled MPI distribution should use SLURM’s srun
command rather than mpirun
/mpiexec
. Click here for an example of a HOOMD-Blue job.
In order to build a MPI application for the maxwell partition, we recommend launching an interactive job on one of the maxwell nodes via salloc
:
salloc --partition=maxwell --account=<group> --gres=gpu:1 --time=4:00:00 --mem=20G
Testing a GPU job interactively
To test your application without submitting a batch job you can request an interactive job session via salloc
as explained on the SLURM page, using your GPU-enabled account and partition. This will not work with multiple GPU applications that require the use of srun
.
Checking the status of a GPU job
It is possible to check the status of the GPU compute nodes by using the qSummary
and SlurmActive
commands.