The primary source for documentation on Slurm usage and commands can be found at the Slurm site. Please also consult the man pages on Slurm command, e.g. typing man sbatch
will give you extensive information on the sbatch
command.
sheldon.physik.fu-berlin.de
sbatch
command
Consider the following bash script with #SBATCH
comments, which tell Slurm what resources you need:
#!/bin/bash #SBATCH --job-name=example_job # Job name, will show up in squeue output #SBATCH --ntasks=1 # Number of individual tasks, usually 1 except when using MPI, etc. #SBATCH --cpus-per-task=1 # Number of CPUs #SBATCH --nodes=1 # Number of nodes, usaully 1 except when using MPI, etc. #SBATCH --time=0-00:01:00 # Runtime in DAYS-HH:MM:SS format #SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem) #SBATCH --output=job_%j.out # File to which standard out will be written #SBATCH --error=job_%j.err # File to which standard err will be written #SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --mail-user=j.d@fu-berlin.de # Email to which notifications will be sent # store job info in output file, if you want... scontrol show job $SLURM_JOBID # run your program... hostname # wait some time... sleep 50
Please note that your job will be killed by the queueing system if it tries to use more memory than requested or if it runs longer than the time specified in the batch script. So to be on the safe side you can set these values a litte bit higher. If you set the values to high, your job might not start because there are not enough resources (e.g. no machine has that amount of memory you are asking for).
Now just submit your job script using sbatch job1.sh
from the command line. Please try to run jobs directly from the /scratch
cluster wide filesystem, where you have a directory under /scratch/<username>
to lower the load on /home
. For testing purposes set the runtime of your job below 1 minute and submit it to the test partition by adding -p test
to sbatch:
dreger@sheldon-ng:..dreger/quickstart> pwd /scratch/dreger/quickstart dreger@sheldon-ng:..dreger/quickstart> sbatch -p test example_job.sh Submitted batch job 26495 dreger@sheldon-ng:..dreger/quickstart> squeue -l -u dreger Sun Jun 29 23:02:50 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 26495 test job1 dreger RUNNING 0:24 1:00 1 x001 dreger@sheldon-ng:..dreger/quickstart> cat job_26495.out JobId=26495 Name=example_job UserId=dreger(4440) GroupId=fbedv(400) Priority=10916 Account=fbedv QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2014-06-29T23:02:26 EligibleTime=2014-06-29T23:02:26 StartTime=2014-06-29T23:02:26 EndTime=2014-06-29T23:03:26 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=test AllocNode:Sid=sheldon-ng:27448 ReqNodeList=(null) ExcNodeList=(null) NodeList=x001 BatchHost=x001 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/clusterfs/scratch/dreger/quickstart/example_job.sh WorkDir=/clusterfs/scratch/dreger/quickstart x001
You want to run something. Some binary (e.g. my_prog
) with some parameters (e.g. param1
, param2
, param3
).
#!/bin/bash # #SBATCH --time=0-00:01:00 # Runtime in DAYS-HH:MM:SS format # #SBATCH --ntasks=1 # Number of processes #SBATCH --cpus-per-task=1 # Number of cores #SBATCH --mem-per-cpu=100 # Memory per cpu in MB (see also --mem) # #SBATCH --chdir=/scratch/username # #SBATCH --mail-user=user@physik.fu-berlin.de # Email to which notifications will be sent #SBATCH --mail-type=END # Type of email notification: BEGIN,END,FAIL,ALL my_prog param1 param2 param3
Save that to a file (e.g. my_first_job.sh
) on the cluster node and submit it to the cluster manager via sbatch my_first_job.sh
. You'll get output similar to this
Submitted batch job 12345
Now you can wait for the job to run and finish. You'll receive a mail once it does.
Or if you change your mind, you cancel the job with the help of the number returned by sbatch
above: scancel 12345
.
What's happening in the above example? Let's go line by line:
It's a shell script, specifically a bash script, as evidenced by the first line
#!/bin/bash
This means, that you can you do in it basically everything you can do in a terminal.
The things prepended by #
symbols are comments (which you probably don't use in interactive shell sessions all that much). The cluster manager will read options from the special comments of the form #SBATCH
as long as they are before any other line that is executed, i.e. the first line that is not a comment. Let's look at the three sections of comments
#SBATCH --time=0-00:01:00 # Runtime in DAYS-HH:MM:SS format
The first sets the time after which the job will be killed. In this case this is one minute. Setting this is optional, but strongly recommended. It defaults to the maximum time allowed, which for our cluster is two weeks, which in return will make your job hard to schedule.
#SBATCH --ntasks=1 # Number of processes #SBATCH --cpus-per-task=1 # Number of cores #SBATCH --mem-per-cpu=100M # Memory per cpu in MB (see also --mem)
The second section describes your job. We will say more about this in Getting a spot, but here we say our program will run a single process (e.g. one Python process or a Fortran program we wrote, but not something using MPI). That process will receive a single CPU core (slurm calls cores CPUs and CPUs sockets, because terminalogy has never confused anybody) and 100MiB of memory in total.
#SBATCH --chdir=/scratch/username
The third section sets the working directory of the job. By default the working directory of the job, i.e. the directory relative to which all actions in the job script are taken, will be wherever you ran sbatch
(not the directory where the job script is situated). This directory has to exist at the time the job starts otherwise your job will fail. If you want to create a directory at job runtime, you will need to cd
in body of the script. Beware, whatever your job prints, which by default ends up in a file slurm-<JOBID>.out
(you can change this via #SBATCH --output=my_output_filename
), will end up in the working directory, so to have every neat and orderly, create directories beforehand.
#SBATCH --mail-user=user@physik.fu-berlin.de # Email to which notifications will be sent #SBATCH --mail-type=END # Type of email notification: BEGIN,END,FAIL,ALL
The last section tells the cluster to to send as a mail to user@physik.fu-berlin
(you better change user
to your username) at the successful end of the job.
The order of these comments is not really important, as long as they all come before the first non-comment line.
You can come back to this, if you are confused by some terminology in Slurm's documentation or if you want to confuse yourself.
sbatch
. It is identified by a Job ID and runs on an allocation.srun
.Now that you know how to run something, let's give you some tools in hand, so that you know your environment.
sinfo
will show you information about partitions and nodes. The default output is partition centric, whereas sinfo --Node
is node-centric. With this command you can learn which nodes are running, which are not, which ones are idle (and therefore free to take your calculations) or only partly used and (using some more options), which
sqeue
will show you the current queue. One important option here is --user=
to she only the jobs of the named user.
sacct
can show you accounting information for jobs and users, e.g. which jobs succeded and which did not, on which node they ran. The output of sacct
might seem empty most of the time, since it defaults to only showing today's jobs, use --start
and --end
to specify a timeframe. sacct
is particularly helpful to have a look at how much resources your job used, so you can maybe ask for smaller allocations allowing you to run more jobs.
All these commands can be tweaked to put out what you need or to use them in scripts. Important options common to all of them are
--long
/ -l
for longer output,--noheader
(sometimes -h
, sometimes -n
), to omit the explanatory header,--format
(usually -o
) to format the output yourself, and sometimes
Have a look at the manpages for all options and specifically for the --format
specifiers.
scontrol
is mostly an administrational tool, but the show
command is also useful for users. This command can be combined to show information slurm has about something, e.g.
# show information about Job 12345 scontrol show JobID=12345 # show information on a node z001 scontrol show Node=z001
In our example we ran a very simple program. It was just a single process on a single node.
Mind you, all these settings are you asking for allocations, this will not magically make your program use it all. You can ask for 20 cores und only use a single one, which is not a good usage of resources. You can also ask for less then you use, but then the processes and/or threads will fight about the few ressources.
A good strategy is to first run a shorter test version of your problem, preferably on the test partition (submit with sbatch --partition=test my_job.sh
) to test your assumptions about how much resources your job is using.
If your program uses multiple threads, e.g. OpenBLAS when you do linear algebra with NumPy, increase --cpus-per-task
, e.g.
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=4
to give your job four cores (and a total of 400MiB of memory if you keep --mem-per-cpu
unchanged from our initila example).
If you want to run multiple processes adjust the --ntasks=
, e.g.
#SBATCH --nodes=1 #SBATCH --ntasks=5 srun my_prog
to run five processes. We add --nodes=1
here to ensure, that all processes will run on a single machine. If you want to mix and match this with multi-threading, just go right ahead
#SBATCH --nodes=1 #SBATCH --ntasks=5 #SBATCH --cpus-per-task=4 srun my_prog
which would run five processes, using four cores each, for a total allocation of 20 cores. This is all on a single node, though. If you want to use multiple nodes (and your calculation supports this)
#SBATCH --ntasks=23 #SBATCH --ntasks-per-node=8 #SBATCH --nodes=3 srun my_prog
This would run 23 processes distributed on three nodes, with no more than eight processes on a given node.
Prefixing your programs with srun
will also take care of all necessary MPI-setup (it doesn't on tron
, though, since it's too old, there you need to use something along the lines of mpiexec -n $SLURM_NTASKS
).
Some problems are just embarrasingly parallel. They just stand on their own and don't need anything else. Good examples for this are if you just want to farm a large parameter space for some output. We have seen a lot of people generate job files programatically to solve this problem, submitting them right away and deleting them right afterwards, but Slurm comes with support for this right out of the box: Job Arrays.
#SBATCH --array=1-10
The above example will create a job that is run 9 times and the different instances can be told apart by the value of the environment variable SLURM_ARRAY_TASK_ID
, i.e. (to have a short example)
#SBATCH --array=1-10 my_program $SLURM_ARRAY_TASK_ID
will be run 9 times, with values 1 to 9 (both inclusive) for SLURM_ARRAY_TASK_ID
.
Since most programs will not use integer indices as input, you will have to somehow map the index to your inputs. You can do this e.g. via arrays
#SBATCH --array=0-4 declare -a parameters parameters=( "arg0a arg0b arg0c" "arg1a arg1b arg1c" "arg2a arg2b arg2c" "arg3a arg3b arg3c" ) my_program ${parameters[$SLURM_ARRAY_TASK_ID]}
A (better) alternative would be to have my_program
read its parameters from a file, whose name you determine from the value of SLURM_ARRAY_TASK_ID
.
Job arrays can be mixed with all of the above (multi-threading, multiple processes). As many array instances as possible will run concurrently.
The minimum array index is 0. You can also make comma-seperated lists (0,3,15,45
), ranges with steps (0-15:4
equal to 0,4,8,12
) and limits on concurrently (0-15%4
will run 0 to 14 but no more than four at a time).<HTML></p></HTML><
So far all examples were homogeneous, one program would use all the resources, either monolithically or possibly via multiple instances of itself (MPI). But you can also do your own sub-allocations, using job-control in the shell
#SBATCH: --nodes=2 #SBATCh: --ntasks=4 # this will use the whole allocation srun calcprog srun --nodes=1 --ntasks=4 --exclusive postprocess1 & srun --nodes=1 --ntasks=4 --exclusive postprocess2 & wait
This will run calcprog
will all resources and once it is done, it will run postprocess1
and postprocess2
at the same time on half of the resources each.
You can also control this with sruns
's --multi-prog
option. Have a look at the man page if you have setups where, e.g. one process is primary and the others are workers.
If you need a GPU you need to specifically request it:
#SBATCH --gres=gpu:1
By default sbatch will name the job like the filename of the jobscript, you can change this via
#SBATCH --job-name
You can also change various things dynamically, e.g.
#SBATCH --output %x_%j.out
will put everything the job prints to foobar_12345.out
if your job's name is foobar
and its job ID is 12345.
There's more of these specifiers out there, have a look at "filename pattern" section in man sbatch
.
By default --output=
puts both stdout and stderr into the same file, you can split them apart, by also using --error=
, i.e.
#SBATCH --output %x_%j.out #SBATCH --error %x_%j.err
will put everything sent to stdout to the former file and everything sent to stderr to the latter.
We already mentioned the SLURM_ARRAY_TASK_ID
environment variable in the section on job arrays. Slurm will set quite a few environment with information about the job, that you can use in your job script and the program(s) it runs. Have a look at the section "OUTPUT ENVIRONMENT VARIABLES" in man sbatch
.
You don't need to write your job scripts in Bash or any other shell dialect. As long the language supports #
as comment syntax, you can write #SBATCH
directives and Slurm won't care it's not running a Bash script.
You probably want to run as many jobs as possible. This is prevented by two things:
To make resource allocation fair while maximising resource utilisation, there are two mechanism: