slurm
- Queueing System
On the HC systems we use the queueing system slurm
for submitting jobs to the cluster.
An overview can be found here.
Frequently used commands: |
|
---|---|
queue overview |
|
partition overview |
|
submit job |
|
cancel job |
|
job info (short) |
|
job info |
|
allocate CPU resources interactively |
|
allocate GPU resources interactively |
|
For a comparison of slurm
commands towards other scheduling systems see slurm.schedmd.com/rosetta.
magnitUDE
Accessing specific ressources
There are three types of compute nodes at magnitUDE, which differ only in the main memory:
Normal Compute Nodes: 64 GB main memory - node[073-564]
FAT Compute Nodes: 128 GB main memory - node[001-072]
SUPERFAT Compute Nodes: 256 GB main memory - node[565-624]
Nodes type FAT and SUPERFAT are equally available in all partitions, they are just held back until no more Normal resource is available.
The nodes types are requested by a flag in the slurm
script:
...
#SBATCH --constraint="FAT"
...
...
#SBATCH --constraint="SUPERFAT"
...
Currently, FAT and SUPERFAT nodes can be requested by all users. However, if the flag is found in too many batchscripts as a ‘default setting’ we reserve the right to introduce additional account requirements in the future.
OmniPath (OPA) is used as the interconnect. Each of the OPA switches has 48 ports. A two-tier network is implemented on the magnitUDE, consisting of:
16 edge switches
5 spine switches
36 compute nodes are connected to each of the edge switches. The remaining ports realize the connection to the spine switches. The 36 compute nodes of a switch can be requested as a network island by slurm
. Excerpt from the man pages:
...
#SBATCH --switches=<count>[@<max-time>]
...
When a tree topology is used, this defines the maximum count
of switches desired for the job allocation and optionally the maximum time
to wait for that number of switches. If slurm
finds an allocation containing
more switches than the count specified, the job remains pending
until it either finds an allocation with desired switch count or the time
limit expires.
It there is no switch count limit, there is no delay in
starting the job.
Acceptable time formats include:
“minutes”,
“minutes:seconds”
“hours:minutes:seconds”
“days-hours”
“days-hours:minutes”
“days-hours:minutes:seconds”.
For example,
...
#SBATCH -N 36
#SBATCH --switches=1@1-12
...
places the job only contiguously on a network island. I will wait 1.5 days until this is possible, then the job will be started without the switch restriction.
With the command scontrol show topology
, you can also show the topology of the switches/nodes.
Example: Job script with MPI
#!/bin/bash
# name of the job
#SBATCH -J test_20
#SBATCH -N 64
#SBATCH --tasks-per-node=24
#SBATCH --cpus-per-task=2
# runtime after which the job will be killed (estimated)
#SBATCH -t 02:00:00
# Change to the directory the job was submitted from
cd $SLURM_SUBMIT_DIR
# gnu openmp
export OMP_NUM_THREADS=1
time mpirun -np 1536 --report-bindings --bind-to core --rank-by core ./PsiPhi > log 2>&1
#last line
amplitUDE
There are four types of compute nodes at amplitUDE, which differ only in the main memory:
STD Compute CPU nodes: 512 GB main memory - ndstd[001-176]
HIM Compute CPU nodes: 1024 GB main memory - ndhim[001-051]
SHM Compute CPU nodes: 2048 GB main memory - ndshm[001-013]
GPU Compute GPU-CPU nodes: 1024 GB main memory - ndgpu[001-019]
Partitions
partition |
job time limit |
hostname |
max. RAM allocation per node |
num. of allocatable nodes |
---|---|---|---|---|
STD-s-96h* |
4-00:00:00 |
ndstd[001-184] ndhim[001-051] ndshm[001-013] |
512 GB |
1 - 45 |
STD-m-48h |
2-00:00:00 |
ndstd[001-184] ndhim[001-051] ndshm[001-013] |
512 GB |
46 - 92 |
STD-l-12h |
0-12:00:00 |
ndstd[001-184] ndhim[001-051] ndshm[001-013] |
512 GB |
93 - 248 |
HIM-s-96h |
4-00:00:00 |
ndhim[001-051] |
1024 GB |
1 - 11 |
HIM-m-48h |
2-00:00:00 |
ndhim[001-051] |
1024 GB |
12 - 25 |
HIM-l-12h |
0-12:00:00 |
ndhim[001-051] |
1024 GB |
26 - 51 |
SHM-s-96h |
4-00:00:00 |
ndshm[001-013] |
2048 GB |
1 - 6 |
SHM-l-24h |
1-00:00:00 |
ndshm[001-013] |
2048 GB |
7 - 13 |
GPU-big |
2-00:00:00 |
ndgpu[001-008][011-019] |
1024 GB |
1 |
GPU-small |
2-00:00:00 |
ndgpu[009-010] |
1024 GB |
1 |
*: default partition, if no one is specified in submission script
Usage of GPU in computations
GPUs can be requested in the slurm submit script using
#SBATCH --gres=gpu:2
together with the amount of GPUs (here 2).
The environment variable $CUDA_VISIBLE_DEVICES is then set accordingly during the execution of the job.
Example: Job script with MPI
Define the number of MPI processes that should be started via the number of nodes --nodes
and the number of MPI processes per node --ntasks-per-node
.
In this example, the executable will be run on 2 nodes with 72 MPI processes per node.
#!/bin/bash
#SBATCH -J jobname # name of the job
#SBATCH --nodes=2 # number of compute nodes
#SBATCH --ntasks-per-node=72 # number of MPI processes per compute node
#SBATCH --time=2:00:00 # max. run-time
#SBATCH --mail-type=ALL # all events are reported via e-mail
#SBATCH --mail-user=your-email@uni-due.de # user's e-mail adress
# Change to the directory the job was submitted from
cd $SLURM_SUBMIT_DIR
# Starting parallel application using MPI
mpirun ./application