Skip to main content

GPUs, Parallel Processing, and Job Arrays

This page describes advanced capabilities of SLURM. For a basic introduction to SLURM, see SLURM: Scheduling and Managing Jobs.

Parallel Job Example Scripts

Below are example SLURM scripts for jobs employing parallel processing. In general, parallel jobs can be separated into four categories:

  • Distributed memory programs that include explicit support for message passing between processes (e.g. MPI). These processes execute across multiple CPU cores and/or nodes.
  • Multithreaded programs that include explicit support for shared memory processing via multiple threads of execution (e.g. Posix Threads or OpenMP) running across multiple CPU cores.
  • Embarrassingly parallel analysis in which multiple instances of the same program execute on multiple data files simultaneously, with each instance running independently from others on its own allocated resources (i.e. CPUs and memory). SLURM job arrays offer a simple mechanism for achieving this.
  • GPU (graphics processing unit) programs including explicit support for offloading to the device via languages like CUDA or OpenCL.

It is important to understand the capabilities and limitations of an application in order to fully leverage the parallel processing options available on the ACCRE cluster. For instance, many popular scientific computing languages like Python , R , and Matlab now offer packages that allow for GPU or multithreaded processing, especially for matrix and vector operations.

MPI Jobs

Jobs running MPI (Message Passing Interface) code require special attention within SLURM. SLURM allocates and launches MPI jobs differently depending on the version of MPI used (e.g. OpenMPI or Intel MPI). We recommend using the most recent version of OpenMPI or Intel MPI available through Lmod to compile code and then using SLURM’s srun command to launch parallel MPI jobs. The example below runs MPI code compiled by GCC+ OpenMPI:

#SBATCH --mail-type=ALL
#SBATCH --nodes=3
#SBATCH --tasks-per-node=8     # 8 MPI processes per node
#SBATCH --time=7-00:00:00
#SBATCH --mem=4G     # 4 GB RAM per node
#SBATCH --output=mpi_job_slurm.log
module load GCC OpenMPI
srun ./test  # srun is SLURM's version of mpirun/mpiexec

This example requests 3 nodes and 8 tasks (i.e. processes) per node, for a total of 24 MPI tasks. By default, SLURM allocates 1 CPU core per process, so this job will run across 24 CPU cores. Note that srun accepts many of the same arguments as mpirun / mpiexec (e.g. -n <number cpus>) but also allows increased flexibility for task affinity, memory, and many other features. Type man srun for a list of options.

More information about running MPI jobs within SLURM can be found here here:

Feel free to open a help desk ticket if you require assistance with your MPI job.


This command is used to launch a parallel job step. Typically, srun is invoked from a SLURM job script to launch a MPI job (much in the same way that mpirun or mpiexec are used). More details about running MPI jobs within SLURM are provided in the GPUs, Parallel Processing and Job Arrays section. Please note that your application must include MPI code in order to run in parallel across multiple CPU cores using srun . Invoking srun on a non-MPI command or executable will result in this program being independently run X times on each of the CPU cores in the allocation.

Alternatively, srun can be run directly from the command line on a gateway, in which case srun will first create a resource allocation for running the parallel job. The -n [CPU_CORES] option is passed to specify the number of CPU cores for launching the parallel job step. For example, running the following command from the command line will obtain an allocation consisting of 16 CPU cores and then run the command hostname across these cores:

srun -n 16 hostname

For more information about srun see:

Multithreaded Jobs

Multithreaded programs are applications that are able to execute in parallel across multiple CPU cores within a single node using a shared memory execution model. In general, a multithreaded application uses a single process (i.e. “task” in SLURM) which then spawns multiple threads of execution. By default, SLURM allocates 1 CPU core per task. In order to make use of multiple CPU cores in a multithreaded program, one must include the --cpus-per-task option. The ACCRE cluster features 8-core and 12-core nodes, so a user can request up to 12 CPU cores per task. Below is an example of a multithreaded program requesting 4 CPU cores per task. The program itself is responsible for spawning the appropriate number of threads.

#SBATCH --mail-type=ALL
#SBATCH --nodes=1 
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 # 4 threads per task 
#SBATCH --time=02:00:00 # two hours 
#SBATCH --mem=4G 
#SBATCH --output=multithread.out 
#SBATCH --job-name=multithreaded_example 
# Load the most recent version of GCC available through Lmod
module load GCC  
# Run multi-threaded application

Job Arrays

Job arrays are useful for submitting and managing a large number of similar jobs. As an example, job arrays are convenient if a user wishes to run the same analysis on 100 different files. SLURM provides job array environment variables that allow multiple versions of input files to be easily referenced. In the example below , three input files called , , and are used as input for three independent Python jobs:

#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=2:00:00
#SBATCH --mem=2G
#SBATCH --array=0-2
#SBATCH --output=python_array_job_slurm_%A_%a.out


# Load Anaconda distribution of Python
module load Anaconda2
python vectorization_${SLURM_ARRAY_TASK_ID}.py

The #SBATCH --array=0-2 line specifies the array size (3) and array indices (0, 1, and 2). These indices are referenced through the SLURM_ARRAY_TASK_ID environment variable in the final line of the SLURM batch script to independently analyze the three input files. Each Python instance will receive its own resource allocation; in this case, each instance is allocated 1 CPU core (and 1 node), 2 hours of wall time, and 2 GB of RAM.

One implication of allocating resources per task is that the node count will not apply across all tasks, so specifying --nodes=1 will not limit all tasks within an array to a single node. To limit the total number of CPU cores (and thus tasks) used simultaneously, use %[CPU_COUNT] following the --array= option. For example, --array=0-100%4 will limit the tasks to running on 4 CPU cores simultaneously. This means the tasks will execute in batches of 4 until all 100 tasks have completed.

The --array= option is flexible in terms of the index range and stride length. For instance, --array=0-10:2 would give indices of 0, 2, 4, 6, 8, and 10.

The %A and %a variables provide a method for directing standard output to separate files. %A references the SLURM_ARRAY_JOB_ID while %a references SLURM_ARRAY_TASK_ID. SLURM treats job ID information for job arrays in the following way: each task within the array has the same SLURM_ARRAY_JOB_ID, and its own unique SLURM_JOBID and SLURM_ARRAY_TASK_ID. The JOBID shown from squeue is formatted by SLURM_ARRAY_JOB_ID followed by an underscore and the SLURM_ARRAY_TASK_ID.

While the previous example provides a relatively simple method for running analyses in parallel, it can at times be inconvenient to rename files so that they may be easy indexed from within a job array. The following example provides a method for analyzing files with arbitrary file names, provided they are all stored in a sub-directory named data :

#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=2:00:00
#SBATCH --mem=2G
#SBATCH --array=1-5   # In this example we have 5 files to analyze
#SBATCH --output=python_array_job_slurm_%A_%a.out
arrayfile=`ls data/ | awk -v line=$SLURM_ARRAY_TASK_ID '{if (NR == line) print $0}'`
module load Anaconda2 # load Anaconda distribution of Python
python data/$arrayfile

More information can be found here:

GPU Jobs

ACCRE has 27 compute nodes equipped with Nvidia GPU cards for general-purpose GPU computing. The nodes are divided into two partitions depending on the type of GPU available on the node:

number of nodes1512
GPU4 x Nvidia Pascal4 x Nvidia Maxwell
CPU cores8 (Intel Xeon E5-2623 v4)12 (Intel Xeon E5-2620 v3)
CUDA cores (per GPU)35843072
host memory128 GB128 GB
GPU memory12 GB12 GB
network56 Gbps RoCE56 Gbps RoCE
gres1 GPU + 2 CPUs1 GPU + 3 CPUs

Users can request the desired amount of GPUs by using SLURM generic resources, also called gres. Each gres bundles together one GPU to multiple CPU cores (see table above) belonging to the same PCI Express root complex to minimize data transfer latency between host and GPU memory. The number of CPU cores requested cannot be higher than the sum of cores in the requested gres .

Below is an example SLURM script header to request 2 Pascal GPUs and 4 CPU cores on a single node on the pascal partition:

#SBATCH --account=<your_gpu_account>
#SBATCH --partition=pascal
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem=20G
#SBATCH --time=2:00:00
#SBATCH --output=gpu-job.log

Note that you must be in one of the GPU groups on the cluster and specify this group from the job script in order to run jobs on the GPU cluster. The #SBATCH --partition=<pascal OR maxwell> line is also required in the job script.

Several versions of the Nvidia CUDA API are available on the cluster and can be selected via Lmod:

[bob@vmps12]$ module spider cuda


[bob@vmps12]$ module load CUDA/8.0.61

There are currently a handful applications available that allow you to leverage the low-latency RoCE network available on the maxwell and pascal partitions. Note that both GPU partitions are intended for GPU jobs only, so users are not allowed to run purely CPU programs. The GPU nodes (both the maxwell and pascal partitions) support serial CPU execution as well as parallel CPU execution using either a multi-threaded, shared memory model (e.g. with OpenMP) or a multi-process, distributed memory execution (i.e. with MPI). Two flavors of RoCE-enabled MPI are available on the cluster, as well as Gromacs and HOOMD-Blue.

All jobs making use of a RoCE-enabled MPI distribution should use SLURM’s srun command rather than mpirun/mpiexec . Click here for an example of a HOOMD-Blue job.

In order to build a MPI application for the maxwell partition, we recommend launching an interactive job on one of the maxwell nodes via salloc :

salloc --partition=maxwell --account=<group> --gres=gpu:1 --time=4:00:00 --mem=20G

To test your application without submitting a batch job you can request an interactive job session via salloc as explained in the corresponding paragraphs. This will not work with multiple GPU applications that require the use of srun .

It is possible to check the status of the GPU compute nodes by using the qSummary and SlurmActive commands.