Introduction to the Slurm HPC cluster

The primary source for documentation on Slurm usage and commands can be found at the Slurm site. Please also consult the man pages on Slurm command, e.g. typing man sbatch will give you extensive information on the sbatch command.

Quick Start for the impatient

Log in to one of the head nodes using ssh:
- the login node for sheldon cluster: sheldon.physik.fu-berlin.de
Create a job script file to be run by the queuing system, supply information like:
- how much memory to allocate for your job
- how many CPU cores your jobs needs to run
- how long you expect your job to run
- where to and when you want the system to send mail
- where output should be written to
Submit your job script using the sbatch command

Example of a very basic job script

Consider the following bash script with #SBATCH comments, which tell Slurm what resources you need:

example_job.sh

#!/bin/bash
 
#SBATCH --job-name=example_job         # Job name, will show up in squeue output
#SBATCH --ntasks=1                     # Number of individual tasks, usually 1 except when using MPI, etc.
#SBATCH --cpus-per-task=1              # Number of CPUs
#SBATCH --nodes=1                      # Number of nodes, usaully 1 except when using MPI, etc.
#SBATCH --time=0-00:01:00              # Runtime in DAYS-HH:MM:SS format
#SBATCH --mem-per-cpu=100              # Memory per cpu in MB (see also --mem) 
#SBATCH --output=job_%j.out            # File to which standard out will be written
#SBATCH --error=job_%j.err             # File to which standard err will be written
#SBATCH --mail-type=END                # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=j.d@fu-berlin.de   # Email to which notifications will be sent 
 
# store job info in output file, if you want...
scontrol show job $SLURM_JOBID
 
# run your program...
hostname
 
# wait some time...
sleep 50

Please note that your job will be killed by the queueing system if it tries to use more memory than requested or if it runs longer than the time specified in the batch script. So to be on the safe side you can set these values a litte bit higher. If you set the values to high, your job might not start because there are not enough resources (e.g. no machine has that amount of memory you are asking for).

Now just submit your job script using sbatch job1.sh from the command line. Please try to run jobs directly from the /scratch cluster wide filesystem, where you have a directory under /scratch/<username> to lower the load on /home. For testing purposes set the runtime of your job below 1 minute and submit it to the test partition by adding -p test to sbatch:

dreger@sheldon-ng:..dreger/quickstart> pwd
/scratch/dreger/quickstart
dreger@sheldon-ng:..dreger/quickstart> sbatch -p test example_job.sh
Submitted batch job 26495
dreger@sheldon-ng:..dreger/quickstart> squeue -l -u dreger
Sun Jun 29 23:02:50 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
             26495      test     job1   dreger  RUNNING       0:24      1:00      1 x001
dreger@sheldon-ng:..dreger/quickstart> cat job_26495.out
JobId=26495 Name=example_job
   UserId=dreger(4440) GroupId=fbedv(400)
   Priority=10916 Account=fbedv QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2014-06-29T23:02:26 EligibleTime=2014-06-29T23:02:26
   StartTime=2014-06-29T23:02:26 EndTime=2014-06-29T23:03:26
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=test AllocNode:Sid=sheldon-ng:27448
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=x001
   BatchHost=x001
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/clusterfs/scratch/dreger/quickstart/example_job.sh
   WorkDir=/clusterfs/scratch/dreger/quickstart

x001

A less quick start

You want to run something. Some binary (e.g. my_prog) with some parameters (e.g. param1, param2, param3).

#!/bin/bash
#
#SBATCH --time=0-00:01:00                    # Runtime in DAYS-HH:MM:SS format
#
#SBATCH --ntasks=1                           # Number of processes
#SBATCH --cpus-per-task=1                    # Number of cores
#SBATCH --mem-per-cpu=100                    # Memory per cpu in MB (see also --mem)
#
#SBATCH --chdir=/scratch/username
#
#SBATCH --mail-user=user@physik.fu-berlin.de # Email to which notifications will be sent
#SBATCH --mail-type=END                      # Type of email notification: BEGIN,END,FAIL,ALL
 
my_prog param1 param2 param3

Save that to a file (e.g. my_first_job.sh) on the cluster node and submit it to the cluster manager via sbatch my_first_job.sh. You'll get output similar to this

Submitted batch job 12345

Now you can wait for the job to run and finish. You'll receive a mail once it does.

Or if you change your mind, you cancel the job with the help of the number returned by sbatch above: scancel 12345.

What's happening?

What's happening in the above example? Let's go line by line:

It's a shell script, specifically a bash script, as evidenced by the first line

#!/bin/bash

This means, that you can you do in it basically everything you can do in a terminal.

The things prepended by # symbols are comments (which you probably don't use in interactive shell sessions all that much). The cluster manager will read options from the special comments of the form #SBATCH as long as they are before any other line that is executed, i.e. the first line that is not a comment. Let's look at the three sections of comments

#SBATCH --time=0-00:01:00                    # Runtime in DAYS-HH:MM:SS format

The first sets the time after which the job will be killed. In this case this is one minute. Setting this is optional, but strongly recommended. It defaults to the maximum time allowed, which for our cluster is two weeks, which in return will make your job hard to schedule.

#SBATCH --ntasks=1                           # Number of processes
#SBATCH --cpus-per-task=1                    # Number of cores
#SBATCH --mem-per-cpu=100M                   # Memory per cpu in MB (see also --mem)

The second section describes your job. We will say more about this in Getting a spot, but here we say our program will run a single process (e.g. one Python process or a Fortran program we wrote, but not something using MPI). That process will receive a single CPU core (slurm calls cores CPUs and CPUs sockets, because terminalogy has never confused anybody) and 100MiB of memory in total.

#SBATCH --chdir=/scratch/username

The third section sets the working directory of the job. By default the working directory of the job, i.e. the directory relative to which all actions in the job script are taken, will be wherever you ran sbatch (not the directory where the job script is situated). This directory has to exist at the time the job starts otherwise your job will fail. If you want to create a directory at job runtime, you will need to cd in body of the script. Beware, whatever your job prints, which by default ends up in a file slurm-<JOBID>.out (you can change this via #SBATCH --output=my_output_filename), will end up in the working directory, so to have every neat and orderly, create directories beforehand.

#SBATCH --mail-user=user@physik.fu-berlin.de # Email to which notifications will be sent
#SBATCH --mail-type=END                      # Type of email notification: BEGIN,END,FAIL,ALL

The last section tells the cluster to to send as a mail to user@physik.fu-berlin (you better change user to your username) at the successful end of the job.

The order of these comments is not really important, as long as they all come before the first non-comment line.

Some Terminology

You can come back to this, if you are confused by some terminology in Slurm's documentation or if you want to confuse yourself.

Slurm is the cluster manager, a program that runs other programs on behalf of users.
A partition is a group of nodes and a queue. We have three partitions, main, to which jobs are submitted by default, test, which has a much shorter maximum runtime of two hours, and virtual, that is using virtual machines instead of physical nodes.
A node is a host on which Slurm can run jobs.
A queue is a list of jobs, some of them running, some of them waiting to run.
A job is group of steps. It is usually created via sbatch. It is identified by a Job ID and runs on an allocation.
An allocation is a group of resources (nodes, CPUs and memory on those nodes, GPUs, licenses).
A step is a computation that consists of tasks it may use less than a jobs full allocation. It can be created via srun.
A task is one process, but also a number of CPUs, since CPUs are allocated per task.
A CPU is a core on a physical CPU, which is called a socket (except when they are not; as a rule of thumb of you a requesting something go with this definition, if you Slurm is supposed to decide something a core is a core and a CPU is a CPU).

Getting information

Now that you know how to run something, let's give you some tools in hand, so that you know your environment.

sinfo will show you information about partitions and nodes. The default output is partition centric, whereas sinfo --Node is node-centric. With this command you can learn which nodes are running, which are not, which ones are idle (and therefore free to take your calculations) or only partly used and (using some more options), which

sqeue will show you the current queue. One important option here is --user= to she only the jobs of the named user.

sacct can show you accounting information for jobs and users, e.g. which jobs succeded and which did not, on which node they ran. The output of sacct might seem empty most of the time, since it defaults to only showing today's jobs, use --start and --end to specify a timeframe. sacct is particularly helpful to have a look at how much resources your job used, so you can maybe ask for smaller allocations allowing you to run more jobs.

All these commands can be tweaked to put out what you need or to use them in scripts. Important options common to all of them are

--long / -l for longer output,
--noheader (sometimes -h, sometimes -n), to omit the explanatory header,
--format (usually -o) to format the output yourself, and sometimes

Have a look at the manpages for all options and specifically for the --format specifiers.

scontrol is mostly an administrational tool, but the show command is also useful for users. This command can be combined to show information slurm has about something, e.g.

# show information about Job 12345
scontrol show JobID=12345
# show information on a node z001
scontrol show Node=z001

Getting a spot

In our example we ran a very simple program. It was just a single process on a single node.

Mind you, all these settings are you asking for allocations, this will not magically make your program use it all. You can ask for 20 cores und only use a single one, which is not a good usage of resources. You can also ask for less then you use, but then the processes and/or threads will fight about the few ressources.

A good strategy is to first run a shorter test version of your problem, preferably on the test partition (submit with sbatch --partition=test my_job.sh) to test your assumptions about how much resources your job is using.

Multi-Threading

If your program uses multiple threads, e.g. OpenBLAS when you do linear algebra with NumPy, increase --cpus-per-task, e.g.

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

to give your job four cores (and a total of 400MiB of memory if you keep --mem-per-cpu unchanged from our initila example).

Multiple Processes

If you want to run multiple processes adjust the --ntasks=, e.g.

#SBATCH --nodes=1
#SBATCH --ntasks=5
 
srun my_prog

to run five processes. We add --nodes=1 here to ensure, that all processes will run on a single machine. If you want to mix and match this with multi-threading, just go right ahead

#SBATCH --nodes=1
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=4
 
srun my_prog

which would run five processes, using four cores each, for a total allocation of 20 cores. This is all on a single node, though. If you want to use multiple nodes (and your calculation supports this)

#SBATCH --ntasks=23
#SBATCH --ntasks-per-node=8
#SBATCH --nodes=3
 
srun my_prog

This would run 23 processes distributed on three nodes, with no more than eight processes on a given node.

Prefixing your programs with srun will also take care of all necessary MPI-setup (it doesn't on tron, though, since it's too old, there you need to use something along the lines of mpiexec -n $SLURM_NTASKS).

Jobarrays

Some problems are just embarrasingly parallel. They just stand on their own and don't need anything else. Good examples for this are if you just want to farm a large parameter space for some output. We have seen a lot of people generate job files programatically to solve this problem, submitting them right away and deleting them right afterwards, but Slurm comes with support for this right out of the box: Job Arrays.

#SBATCH --array=1-10

The above example will create a job that is run 9 times and the different instances can be told apart by the value of the environment variable SLURM_ARRAY_TASK_ID, i.e. (to have a short example)

#SBATCH --array=1-10
 
my_program $SLURM_ARRAY_TASK_ID

will be run 9 times, with values 1 to 9 (both inclusive) for SLURM_ARRAY_TASK_ID.

Since most programs will not use integer indices as input, you will have to somehow map the index to your inputs. You can do this e.g. via arrays

#SBATCH --array=0-4
declare -a parameters
parameters=(
    "arg0a arg0b arg0c"
    "arg1a arg1b arg1c"
    "arg2a arg2b arg2c"
    "arg3a arg3b arg3c"
)
 
my_program ${parameters[$SLURM_ARRAY_TASK_ID]}

A (better) alternative would be to have my_program read its parameters from a file, whose name you determine from the value of SLURM_ARRAY_TASK_ID.

Job arrays can be mixed with all of the above (multi-threading, multiple processes). As many array instances as possible will run concurrently.

The minimum array index is 0. You can also make comma-seperated lists (0,3,15,45), ranges with steps (0-15:4 equal to 0,4,8,12) and limits on concurrently (0-15%4 will run 0 to 14 but no more than four at a time).

Heterogenous Jobs

So far all examples were homogeneous, one program would use all the resources, either monolithically or possibly via multiple instances of itself (MPI). But you can also do your own sub-allocations, using job-control in the shell

#SBATCH: --nodes=2
#SBATCh: --ntasks=4
# this will use the whole allocation
srun calcprog
srun --nodes=1 --ntasks=4 --exclusive postprocess1 &
srun --nodes=1 --ntasks=4 --exclusive postprocess2 &
wait

This will run calcprog will all resources and once it is done, it will run postprocess1 and postprocess2 at the same time on half of the resources each.

You can also control this with sruns's --multi-prog option. Have a look at the man page if you have setups where, e.g. one process is primary and the others are workers.

Jobs using GPUs

If you need a GPU you need to specifically request it:

#SBATCH --gres=gpu:1

Tricks

Changing the name of the job

By default sbatch will name the job like the filename of the jobscript, you can change this via

#SBATCH --job-name

Less typing

You can also change various things dynamically, e.g.

#SBATCH --output %x_%j.out

will put everything the job prints to foobar_12345.out if your job's name is foobar and its job ID is 12345.

There's more of these specifiers out there, have a look at "filename pattern" section in man sbatch.

Splitting stdout and stderr

By default --output= puts both stdout and stderr into the same file, you can split them apart, by also using --error=, i.e.

#SBATCH --output %x_%j.out
#SBATCH --error %x_%j.err

will put everything sent to stdout to the former file and everything sent to stderr to the latter.

Environment Variables

We already mentioned the SLURM_ARRAY_TASK_ID environment variable in the section on job arrays. Slurm will set quite a few environment with information about the job, that you can use in your job script and the program(s) it runs. Have a look at the section "OUTPUT ENVIRONMENT VARIABLES" in man sbatch.

Batch scripts in other languages

You don't need to write your job scripts in Bash or any other shell dialect. As long the language supports # as comment syntax, you can write #SBATCH directives and Slurm won't care it's not running a Bash script.

How to game the system

You probably want to run as many jobs as possible. This is prevented by two things:

There are limited resources.
There are other users vying for the same resources.

To make resource allocation fair while maximising resource utilisation, there are two mechanism:

Fairshare - The Fairshare of people who have recently run jobs will go down and will only replenish over time. Jobs from users with higher Fairshare will be preferred when allocating resources
Backfilling - If there are unused resources that cannot currently be used, because the requirements of the all the highest-priority jobs in the queue cannot be met (e.g. the next job in queue needs 20 cores for four hours, but only 10 are available in that time), lower priority jobs will be used to fill those holes (e.g. if there's a job wanting 5 cores for three hours, that will be used).

DokuWiki

Table of Contents