====== Introduction to the Slurm HPC cluster ======

The primary source for documentation on Slurm usage and commands can be found at the [[http://slurm.schedmd.com/documentation.html|Slurm site]]. Please also consult the man pages on Slurm command, e.g. typing ''man sbatch'' will give you extensive information on the ''sbatch'' command.

===== Quick Start for the impatient =====

  - Log in to one of the head nodes using ssh:
    * the login node for sheldon cluster: ''sheldon.physik.fu-berlin.de''
  - Create a job script file to be run by the queuing system, supply information like:
    * how much memory to allocate for your job
    * how many CPU cores your jobs needs to run
    * how long you expect your job to run
    * where to and when you want the system to send mail
    * where output should be written to
  - Submit your job script using the ''sbatch'' command

==== Example of a very basic job script ====

Consider the following bash script with ''#SBATCH'' comments, which tell Slurm what resources you need:

<file bash example_job.sh>
#!/bin/bash

#SBATCH --job-name=example_job         # Job name, will show up in squeue output
#SBATCH --ntasks=1                     # Number of individual tasks, usually 1 except when using MPI, etc.
#SBATCH --cpus-per-task=1              # Number of CPUs
#SBATCH --nodes=1                      # Number of nodes, usaully 1 except when using MPI, etc.
#SBATCH --time=0-00:01:00              # Runtime in DAYS-HH:MM:SS format
#SBATCH --mem-per-cpu=100              # Memory per cpu in MB (see also --mem) 
#SBATCH --output=job_%j.out            # File to which standard out will be written
#SBATCH --error=job_%j.err             # File to which standard err will be written
#SBATCH --mail-type=END                # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=j.d@fu-berlin.de   # Email to which notifications will be sent 

# store job info in output file, if you want...
scontrol show job $SLURM_JOBID

# run your program...
hostname

# wait some time...
sleep 50
</file>

Please note that your job will be killed by the queueing system if it tries to use more memory than requested or if it runs longer than the time specified in the batch script. So to be on the safe side you can set these values a litte bit higher. If you set the values to high, your job might not start because there are not enough resources (e.g. no machine has that amount of memory you are asking for).

Now just submit your job script using ''sbatch job1.sh'' from the command line. Please try to run jobs directly from the ''/scratch'' cluster wide filesystem, where you have a directory under ''/scratch/<username>'' to lower the load on ''/home''. For testing purposes set the runtime of your job below 1 minute and submit it to the test partition by adding ''-p test'' to sbatch:

<code>
dreger@sheldon-ng:..dreger/quickstart> pwd
/scratch/dreger/quickstart
dreger@sheldon-ng:..dreger/quickstart> sbatch -p test example_job.sh
Submitted batch job 26495
dreger@sheldon-ng:..dreger/quickstart> squeue -l -u dreger
Sun Jun 29 23:02:50 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
             26495      test     job1   dreger  RUNNING       0:24      1:00      1 x001
dreger@sheldon-ng:..dreger/quickstart> cat job_26495.out
JobId=26495 Name=example_job
   UserId=dreger(4440) GroupId=fbedv(400)
   Priority=10916 Account=fbedv QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2014-06-29T23:02:26 EligibleTime=2014-06-29T23:02:26
   StartTime=2014-06-29T23:02:26 EndTime=2014-06-29T23:03:26
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=test AllocNode:Sid=sheldon-ng:27448
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=x001
   BatchHost=x001
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/clusterfs/scratch/dreger/quickstart/example_job.sh
   WorkDir=/clusterfs/scratch/dreger/quickstart

x001
</code>

===== A less quick start =====

You want to run something. Some binary (e.g. ''%%my_prog%%'') with some parameters (e.g. ''%%param1%%'', ''%%param2%%'', ''%%param3%%'').

<code bash>
#!/bin/bash
#
#SBATCH --time=0-00:01:00                    # Runtime in DAYS-HH:MM:SS format
#
#SBATCH --ntasks=1                           # Number of processes
#SBATCH --cpus-per-task=1                    # Number of cores
#SBATCH --mem-per-cpu=100                    # Memory per cpu in MB (see also --mem)
#
#SBATCH --chdir=/scratch/username
#
#SBATCH --mail-user=user@physik.fu-berlin.de # Email to which notifications will be sent
#SBATCH --mail-type=END                      # Type of email notification: BEGIN,END,FAIL,ALL

my_prog param1 param2 param3
</code>

Save that to a file (e.g. ''%%my_first_job.sh%%'') on the cluster node and submit it to the cluster manager via ''%%sbatch my_first_job.sh%%''. You'll get output similar to this

<code example>
Submitted batch job 12345
</code>

Now you can wait for the job to run and finish. You'll receive a mail once it does.

Or if you change your mind, you cancel the job with the help of the number returned by ''%%sbatch%%'' above: ''%%scancel 12345%%''.

==== What's happening? ====

What's happening in the above example? Let's go line by line:

It's a shell script, specifically a bash script, as evidenced by the first line

<code bash>
#!/bin/bash
</code>

This means, that you can you do in it basically everything you can do in a terminal.

The things prepended by ''%%#%%'' symbols are comments (which you probably don't use in interactive shell sessions all that much). The cluster manager will read options from the special comments of the form ''%%#SBATCH%%'' as long as they are //before// any other line that is executed, i.e. the first line that is not a comment. Let's look at the three sections of comments

<code bash>
#SBATCH --time=0-00:01:00                    # Runtime in DAYS-HH:MM:SS format
</code>

The first sets the time after which the job will be killed. In this case this is one minute. Setting this is optional, but strongly recommended. It defaults to the maximum time allowed, which for our cluster is two weeks, which in return will make your job hard to schedule.

<code bash>
#SBATCH --ntasks=1                           # Number of processes
#SBATCH --cpus-per-task=1                    # Number of cores
#SBATCH --mem-per-cpu=100M                   # Memory per cpu in MB (see also --mem)
</code>

The second section describes your job. We will say more about this in //Getting a spot//, but here we say our program will run a single process (e.g. one Python process or a Fortran program we wrote, but not something using MPI). That process will receive a single CPU core (slurm calls cores CPUs and CPUs sockets, because terminalogy has never confused anybody) and 100MiB of memory in total.

<code bash>
#SBATCH --chdir=/scratch/username
</code>

The third section sets the working directory of the job. By default the working directory of the job, i.e. the directory relative to which all actions in the job script are taken, will be wherever you ran ''%%sbatch%%'' (not the directory where the job script is situated). This directory has to exist at the time the job starts otherwise your job will fail. If you want to create a directory at job runtime, you will need to ''%%cd%%'' in body of the script. Beware, whatever your job prints, which by default ends up in a file ''%%slurm-<JOBID>.out%%'' (you can change this via ''%%#SBATCH --output=my_output_filename%%''), will end up in the working directory, so to have every neat and orderly, create directories beforehand.

<code bash>
#SBATCH --mail-user=user@physik.fu-berlin.de # Email to which notifications will be sent
#SBATCH --mail-type=END                      # Type of email notification: BEGIN,END,FAIL,ALL
</code>

The last section tells the cluster to to send as a mail to ''%%user@physik.fu-berlin%%'' (you better change ''%%user%%'' to your username) at the successful end of the job.

The order of these comments is not really important, as long as they all come before the first //non-comment line//.

===== Some Terminology =====

You can come back to this, if you are confused by some terminology in Slurm's documentation or if you want to confuse yourself.

  * //Slurm// is the cluster manager, a program that runs other programs on behalf of users.
  * A //partition// is a group of nodes and a queue. We have three partitions, main, to which jobs are submitted by default, test, which has a much shorter maximum runtime of two hours, and virtual, that is using virtual machines instead of physical nodes.
  * A //node// is a host on which Slurm can run jobs.
  * A //queue// is a list of jobs, some of them running, some of them waiting to run.
  * A //job// is group of //steps//. It is usually created via ''%%sbatch%%''. It is identified by a Job ID and runs on an allocation.
  * An //allocation// is a group of resources (nodes, CPUs and memory on those nodes, GPUs, licenses).
  * A //step// is a computation that consists of tasks it may use less than a jobs full allocation. It can be created via ''%%srun%%''.
  * A //task// is one process, but also a number of CPUs, since CPUs are allocated per task.
  * A //CPU// is a core on a physical CPU, which is called a //socket// (except when they are not; as a rule of thumb of you a requesting something go with this definition, if you Slurm is supposed to decide something a core is a core and a CPU is a CPU).

===== Getting information =====

Now that you know how to run something, let's give you some tools in hand, so that you know your environment.

''%%sinfo%%'' will show you information about partitions and nodes. The default output is partition centric, whereas ''%%sinfo --Node%%'' is node-centric. With this command you can learn which nodes are running, which are not, which ones are idle (and therefore free to take your calculations) or only partly used and (using some more options), which

''%%sqeue%%'' will show you the current queue. One important option here is ''%%--user=%%'' to she only the jobs of the named user.

''%%sacct%%'' can show you accounting information for jobs and users, e.g. which jobs succeded and which did not, on which node they ran. The output of ''%%sacct%%'' might seem empty most of the time, since it defaults to only showing today's jobs, use ''%%--start%%'' and ''%%--end%%'' to specify a timeframe. ''%%sacct%%'' is particularly helpful to have a look at how much resources your job used, so you can maybe ask for smaller allocations allowing you to run more jobs.

All these commands can be tweaked to put out what you need or to use them in scripts. Important options common to all of them are

  * ''%%--long%%'' / ''%%-l%%'' for longer output,
  * ''%%--noheader%%'' (sometimes ''%%-h%%'', sometimes ''%%-n%%''), to omit the explanatory header,
  * ''%%--format%%'' (usually ''%%-o%%'') to format the output yourself, and sometimes

Have a look at the manpages for all options and specifically for the ''%%--format%%'' specifiers.

''%%scontrol%%'' is mostly an administrational tool, but the ''%%show%%'' command is also useful for users. This command can be combined to show information slurm has about something, e.g.

<code bash>
# show information about Job 12345
scontrol show JobID=12345
# show information on a node z001
scontrol show Node=z001
</code>

===== Getting a spot =====

In our example we ran a very simple program. It was just a single process on a single node.

Mind you, all these settings are you asking for allocations, this will not magically make your program use it all. You can ask for 20 cores und only use a single one, which is not a good usage of resources. You can also ask for less then you use, but then the processes and/or threads will fight about the few ressources.

A good strategy is to first run a shorter test version of your problem, preferably on the test partition (submit with ''%%sbatch --partition=test my_job.sh%%'') to test your assumptions about how much resources your job is using.

==== Multi-Threading ====

If your program uses multiple threads, e.g. OpenBLAS when you do linear algebra with NumPy, increase ''%%--cpus-per-task%%'', e.g.

<code bash>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
</code>

to give your job four cores (and a total of 400MiB of memory if you keep ''%%--mem-per-cpu%%'' unchanged from our initila example).

==== Multiple Processes ====

If you want to run multiple processes adjust the ''%%--ntasks=%%'', e.g.

<code bash>
#SBATCH --nodes=1
#SBATCH --ntasks=5

srun my_prog
</code>

to run five processes. We add ''%%--nodes=1%%'' here to ensure, that all processes will run on a single machine. If you want to mix and match this with multi-threading, just go right ahead

<code bash>
#SBATCH --nodes=1
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=4

srun my_prog
</code>

which would run five processes, using four cores each, for a total allocation of 20 cores. This is all on a single node, though. If you want to use multiple nodes (and your calculation supports this)

<code bash>
#SBATCH --ntasks=23
#SBATCH --ntasks-per-node=8
#SBATCH --nodes=3

srun my_prog
</code>

This would run 23 processes distributed on three nodes, with no more than eight processes on a given node.

Prefixing your programs with ''%%srun%%'' will also take care of all necessary MPI-setup (it doesn't on ''%%tron%%'', though, since it's too old, there you need to use something along the lines of ''%%mpiexec -n $SLURM_NTASKS%%'').

==== Jobarrays ====

Some problems are just embarrasingly parallel. They just stand on their own and don't need anything else. Good examples for this are if you just want to farm a large parameter space for some output. We have seen a lot of people generate job files programatically to solve this problem, submitting them right away and deleting them right afterwards, but Slurm comes with support for this right out of the box: Job Arrays.

<code bash>
#SBATCH --array=1-10
</code>

The above example will create a job that is run 9 times and the different instances can be told apart by the value of the environment variable ''%%SLURM_ARRAY_TASK_ID%%'', i.e. (to have a short example)

<code bash>
#SBATCH --array=1-10

my_program $SLURM_ARRAY_TASK_ID
</code>

will be run 9 times, with values 1 to 9 (both inclusive) for ''%%SLURM_ARRAY_TASK_ID%%''.

Since most programs will not use integer indices as input, you will have to somehow map the index to your inputs. You can do this e.g. via arrays

<code bash>
#SBATCH --array=0-4
declare -a parameters
parameters=(
    "arg0a arg0b arg0c"
    "arg1a arg1b arg1c"
    "arg2a arg2b arg2c"
    "arg3a arg3b arg3c"
)

my_program ${parameters[$SLURM_ARRAY_TASK_ID]}
</code>

A (better) alternative would be to have ''%%my_program%%'' read its parameters from a file, whose name you determine from the value of ''%%SLURM_ARRAY_TASK_ID%%''.

Job arrays can be mixed with all of the above (multi-threading, multiple processes). As many array instances as possible will run concurrently.

The minimum array index is 0. You can also make comma-seperated lists (''%%0,3,15,45%%''), ranges with steps (''%%0-15:4%%'' equal to ''%%0,4,8,12%%'') and limits on concurrently (''%%0-15%4%%'' will run 0 to 14 but no more than four at a time).
==== Heterogenous Jobs ====

So far all examples were homogeneous, one program would use all the resources, either monolithically or possibly via multiple instances of itself (MPI). But you can also do your own sub-allocations, using job-control in the shell

<code bash>
#SBATCH: --nodes=2
#SBATCh: --ntasks=4
# this will use the whole allocation
srun calcprog
srun --nodes=1 --ntasks=4 --exclusive postprocess1 &
srun --nodes=1 --ntasks=4 --exclusive postprocess2 &
wait
</code>

This will run ''%%calcprog%%'' will all resources and once it is done, it will run ''%%postprocess1%%'' and ''%%postprocess2%%'' at the same time on half of the resources each.

You can also control this with ''%%sruns%%'''s ''%%--multi-prog%%'' option. Have a look at the man page if you have setups where, e.g. one process is primary and the others are workers.

==== Jobs using GPUs ====

If you need a GPU you need to specifically request it:

<code bash>
#SBATCH --gres=gpu:1
</code>

===== Tricks =====

==== Changing the name of the job ====

By default sbatch will name the job like the filename of the jobscript, you can change this via

<code bash>
#SBATCH --job-name
</code>

==== Less typing ====

You can also change various things dynamically, e.g.

<code bash>
#SBATCH --output %x_%j.out
</code>

will put everything the job prints to ''%%foobar_12345.out%%'' if your job's name is ''%%foobar%%'' and its job ID is 12345.

There's more of these specifiers out there, have a look at "filename pattern" section in ''%%man sbatch%%''.

==== Splitting stdout and stderr ====

By default ''%%--output=%%'' puts both stdout and stderr into the same file, you can split them apart, by also using ''%%--error=%%'', i.e.

<code bash>
#SBATCH --output %x_%j.out
#SBATCH --error %x_%j.err
</code>

will put everything sent to stdout to the former file and everything sent to stderr to the latter.

==== Environment Variables ====

We already mentioned the ''%%SLURM_ARRAY_TASK_ID%%'' environment variable in the section on job arrays. Slurm will set quite a few environment with information about the job, that you can use in your job script and the program(s) it runs. Have a look at the section "OUTPUT ENVIRONMENT VARIABLES" in ''%%man sbatch%%''.

==== Batch scripts in other languages ====

You don't need to write your job scripts in Bash or any other shell dialect. As long the language supports ''%%#%%'' as comment syntax, you can write ''%%#SBATCH%%'' directives and Slurm won't care it's not running a Bash script.

==== How to game the system ====

You probably want to run as many jobs as possible. This is prevented by two things:

  - There are limited resources.
  - There are other users vying for the same resources.

To make resource allocation fair while maximising resource utilisation, there are two mechanism:

  - Fairshare - The Fairshare of people who have recently run jobs will go down and will only replenish over time. Jobs from users with higher Fairshare will be preferred when allocating resources
  - Backfilling - If there are unused resources that cannot currently be used, because the requirements of the all the highest-priority jobs in the queue cannot be met (e.g. the next job in queue needs 20 cores for four hours, but only 10 are available in that time), lower priority jobs will be used to fill those holes (e.g. if there's a job wanting 5 cores for three hours, that will be used).