slurm - Queueing System

On the HC systems we use the queueing system slurm for submitting jobs to the cluster. An overview can be found here.

Frequently used commands:

queue overview

squeue -l

partition overview

sinfo -l

submit job

sbatch <jobscript>

cancel job

scancel <job_id>

job info (short)

squeue --job <job_id>

job info

scontrol show job <job_id>

allocate CPU resources interactively

salloc --partition=<partition_name> --time=<time> --mem=<memory> --cpus-per-task=<cpus>

allocate GPU resources interactively

salloc --partition=<gpu_partition_name> --time=<time> --gres=gpu:<num_gpus> --mem=<memory>

For a comparison of slurm commands towards other scheduling systems see slurm.schedmd.com/rosetta.

magnitUDE

Accessing specific ressources

There are three types of compute nodes at magnitUDE, which differ only in the main memory:

  • Normal Compute Nodes: 64 GB main memory - node[073-564]

  • FAT Compute Nodes: 128 GB main memory - node[001-072]

  • SUPERFAT Compute Nodes: 256 GB main memory - node[565-624]

Nodes type FAT and SUPERFAT are equally available in all partitions, they are just held back until no more Normal resource is available.

The nodes types are requested by a flag in the slurm script:

...
#SBATCH --constraint="FAT"
...
...
#SBATCH --constraint="SUPERFAT"
...

Currently, FAT and SUPERFAT nodes can be requested by all users. However, if the flag is found in too many batchscripts as a ‘default setting’ we reserve the right to introduce additional account requirements in the future.

OmniPath (OPA) is used as the interconnect. Each of the OPA switches has 48 ports. A two-tier network is implemented on the magnitUDE, consisting of:

  • 16 edge switches

  • 5 spine switches

36 compute nodes are connected to each of the edge switches. The remaining ports realize the connection to the spine switches. The 36 compute nodes of a switch can be requested as a network island by slurm. Excerpt from the man pages:

...
#SBATCH --switches=<count>[@<max-time>]
...

When a tree topology is used, this defines the maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired switch count or the time limit expires.
It there is no switch count limit, there is no delay in starting the job.
Acceptable time formats include:

  • “minutes”,

  • “minutes:seconds”

  • “hours:minutes:seconds”

  • “days-hours”

  • “days-hours:minutes”

  • “days-hours:minutes:seconds”.

For example,

...
#SBATCH -N 36
#SBATCH --switches=1@1-12
...

places the job only contiguously on a network island. I will wait 1.5 days until this is possible, then the job will be started without the switch restriction.

With the command scontrol show topology, you can also show the topology of the switches/nodes.

Example: Job script with MPI

#!/bin/bash

# name of the job
#SBATCH -J test_20
#SBATCH -N 64
#SBATCH --tasks-per-node=24
#SBATCH --cpus-per-task=2

# runtime after which the job will be killed (estimated)
#SBATCH -t 02:00:00

# Change to the directory the job was submitted from
cd $SLURM_SUBMIT_DIR

# gnu openmp
export OMP_NUM_THREADS=1

time mpirun  -np 1536 --report-bindings --bind-to core --rank-by core  ./PsiPhi > log 2>&1

#last line

amplitUDE

There are four types of compute nodes at amplitUDE, which differ only in the main memory:

  • STD Compute CPU nodes: 512 GB main memory - ndstd[001-176]

  • HIM Compute CPU nodes: 1024 GB main memory - ndhim[001-051]

  • SHM Compute CPU nodes: 2048 GB main memory - ndshm[001-013]

  • GPU Compute GPU-CPU nodes: 1024 GB main memory - ndgpu[001-019]

Partitions

partition

job time limit

hostname

max. RAM allocation per node

num. of allocatable nodes

STD-s-96h*

4-00:00:00

ndstd[001-184]

ndhim[001-051]

ndshm[001-013]

512 GB

1 - 45

STD-m-48h

2-00:00:00

ndstd[001-184]

ndhim[001-051]

ndshm[001-013]

512 GB

46 - 92

STD-l-12h

0-12:00:00

ndstd[001-184]

ndhim[001-051]

ndshm[001-013]

512 GB

93 - 248

HIM-s-96h

4-00:00:00

ndhim[001-051]

1024 GB

1 - 11

HIM-m-48h

2-00:00:00

ndhim[001-051]

1024 GB

12 - 25

HIM-l-12h

0-12:00:00

ndhim[001-051]

1024 GB

26 - 51

SHM-s-96h

4-00:00:00

ndshm[001-013]

2048 GB

1 - 6

SHM-l-24h

1-00:00:00

ndshm[001-013]

2048 GB

7 - 13

GPU-big

2-00:00:00

ndgpu[001-008][011-019]

1024 GB

1

GPU-small

2-00:00:00

ndgpu[009-010]

1024 GB

1

*: default partition, if no one is specified in submission script

Usage of GPU in computations

GPUs can be requested in the slurm submit script using #SBATCH --gres=gpu:2 together with the amount of GPUs (here 2).

The environment variable $CUDA_VISIBLE_DEVICES is then set accordingly during the execution of the job.

Example: Job script with MPI

Define the number of MPI processes that should be started via the number of nodes --nodes and the number of MPI processes per node --ntasks-per-node.

In this example, the executable will be run on 2 nodes with 72 MPI processes per node.

#!/bin/bash

#SBATCH -J jobname                         # name of the job
#SBATCH --nodes=2                          # number of compute nodes
#SBATCH --ntasks-per-node=72               # number of MPI processes per compute node
#SBATCH --time=2:00:00                     # max. run-time
#SBATCH --mail-type=ALL                    # all events are reported via e-mail
#SBATCH --mail-user=your-email@uni-due.de  # user's e-mail adress

# Change to the directory the job was submitted from
cd $SLURM_SUBMIT_DIR

# Starting parallel application using MPI
mpirun ./application