slurm - Queueing System

On the HC systems we use the queueing system slurm for submitting jobs to the cluster. An overview can be found here.

Frequently used commands:

queue overview

squeue -l

partition overview

sinfo -l

submit job

sbatch <jobscript>

cancel job

scancel <job_id>

job info (short)

squeue --job <job_id>

job info

scontrol show job <job_id>

For a comparison of slurm commands towards other scheduling systems see slurm.schedmd.com/rosetta.

magnitUDE

Accessing specific ressources

There are three types of compute nodes at magnitUDE, which differ only in the main memory:

  • Normal Compute Nodes: 64 GB main memory - node[073-564]

  • FAT Compute Nodes: 128 GB main memory - node[001-072]

  • SUPERFAT Compute Nodes: 256 GB main memory - node[565-624]

Nodes type FAT and SUPERFAT are equally available in all partitions, they are just held back until no more Normal resource is available.

The nodes types are requested by a flag in the slurm script:

...
#SBATCH --constraint="FAT"
...
...
#SBATCH --constraint="SUPERFAT"
...

Currently, FAT and SUPERFAT nodes can be requested by all users. However, if the flag is found in too many batchscripts as a ‘default setting’ we reserve the right to introduce additional account requirements in the future.

OmniPath (OPA) is used as the interconnect. Each of the OPA switches has 48 ports. A two-tier network is implemented on the magnitUDE, consisting of:

  • 16 edge switches

  • 5 spine switches

36 compute nodes are connected to each of the edge switches. The remaining ports realize the connection to the spine switches. The 36 compute nodes of a switch can be requested as a network island by slurm. Excerpt from the man pages:

...
#SBATCH --switches=<count>[@<max-time>]
...

When a tree topology is used, this defines the maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired switch count or the time limit expires.
It there is no switch count limit, there is no delay in starting the job.
Acceptable time formats include:

  • “minutes”,

  • “minutes:seconds”

  • “hours:minutes:seconds”

  • “days-hours”

  • “days-hours:minutes”

  • “days-hours:minutes:seconds”.

For example,

...
#SBATCH -N 36
#SBATCH --switches=1@1-12
...

places the job only contiguously on a network island. I will wait 1.5 days until this is possible, then the job will be started without the switch restriction.

With the command scontrol show topology, you can also show the topology of the switches/nodes.

Example: Job script with MPI

#!/bin/bash

# name of the job
#SBATCH -J test_20
#SBATCH -N 64
#SBATCH --tasks-per-node=24
#SBATCH --cpus-per-task=2

# runtime after which the job will be killed (estimated)
#SBATCH -t 02:00:00

# Change to the directory the job was submitted from
cd $SLURM_SUBMIT_DIR

# gnu openmp
export OMP_NUM_THREADS=1

time mpirun  -np 1536 --report-bindings --bind-to core --rank-by core  ./PsiPhi > log 2>&1

#last line

amplitUDE

There are four types of compute nodes at amplitUDE, which differ only in the main memory:

  • STD Compute CPU nodes: 512 GB main memory - ndstd[001-176]

  • HIM Compute CPU nodes: 1024 GB main memory - ndhim[001-051]

  • SHM Compute CPU nodes: 2048 GB main memory - ndshm[001-013]

  • GPU Compute GPU-CPU nodes: 1024 GB main memory - ndgpu[001-019]

Note: Actually only the GPU-CPU nodes are available

Partitions

node spec

partition

job time limit

hostname

count

GPU-nodes

testgpu*

02:00:00

ndgpu[001-017]

ndgpu[018-019]

17 (4 GPUs per node)

2 (2 GPUs per node)

*: default partition, if no one is specified in submission script

Usage of GPU in computations

GPUs can be requested in the slurm submit script using #SBATCH --gres=gpu:2 together with the amount of GPUs (here 2).

The environment variable $CUDA_VISIBLE_DEVICES is then set accordingly during the execution of the job.

Example: Job script with MPI

Define the number of MPI processes that should be started via the number of nodes --nodes and the number of MPI processes per node --ntasks-per-node.

In this example, the executable will be run on 2 nodes with 72 MPI processes per node.

#!/bin/bash

#SBATCH -J jobname                         # name of the job
#SBATCH --nodes=2                          # number of compute nodes
#SBATCH --ntasks-per-node=72               # number of MPI processes per compute node
#SBATCH --time=2:00:00                     # max. run-time
#SBATCH --mail-type=ALL                    # all events are reported via e-mail
#SBATCH --mail-user=your-email@uni-due.de  # user's e-mail adress

# Change to the directory the job was submitted from
cd $SLURM_SUBMIT_DIR

# Starting parallel application using MPI
mpirun ./application

Things to know with enabled hyperthreading on the cluster

In the current default setting, SLURM schedules jobs using hyperthreads. Hyperthreading has been activated on the entire cluster. This means that each physical core corresponds to two logical cores (hyperthreads). However, the Openmpi libraries installed on the cluster bind the processes started in parallel to each physical core by default, which is currently causing problems.

To ensure that the Openmpi libraries also use hyperthreads (“logical cores”), this must currently be explicitly specified in the MPI command:

mpirun --use-hw-threads ...

If processes are to be started for each physical core, this must be specified in SLURM:

sbatch --hint=nomultithread ...