Table of Contents
Introduction to the Slurm HPC cluster
The primary source for documentation on Slurm usage and commands can be found at the Slurm site. Please also consult the man pages on Slurm command, e.g. typing man sbatch
will give you extensive information on the sbatch
command.
Quick Start for the impatient
- Log in to one of the head nodes using ssh:
- login node for sheldon cluster: sheldon.physik.fu-berlin.de
- login node for yoshi cluster: yoshi.physik.fu-berlin.de
- login node for tron cluster: tron.physik.fu-berlin.de (currently offline)
- Create a job script file to be run by the queuing system, supply information like:
- how much memory to allocate for your job
- how many cpu cores your jobs needs to run
- how long you expect your job to run
- where to and when you want the system to send mail
- where output should be written to
- Submit your job script using the
sbatch
command
Example of a very basic job script
Consider the following bash script with #SBATCH comments, which tell Slurm what resources you need:
<xterm> #!/bin/bash
#SBATCH –job-name=job1 # Job name, will show up in squeue output #SBATCH –ntasks=1 # Number of cores #SBATCH –nodes=1 # Ensure that all cores are on one machine #SBATCH –time=0-00:01:00 # Runtime in DAYS-HH:MM:SS format #SBATCH –mem-per-cpu=100 # Memory per cpu in MB (see also –mem) #SBATCH –output=job1_%j.out # File to which standard out will be written #SBATCH –error=job1_%j.err # File to which standard err will be written #SBATCH –mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH –mail-user=j.d@fu-berlin.de # Email to which notifications will be sent
# store job info in output file, if you want… scontrol show job $SLURM_JOBID
# run your program… hostname
# wait some time… sleep 50 </xterm>
Please note that your job will be killed by the queueing system if it tries to use more memory than requested or if it runs longer than the time specified in the batch script. So to be on the safe side you can set these values a litte bit higher. If you set the values to high, your job might not start because there are not enough resources (e.g. no machine has that amount of memory you are asking for).
Now just submit your job script using sbatch job1.sh
from the command line. Please try to run jobs directly from the /scratch/username cluster wide filesystem to lower the load on the /home server. For testing purposes set the runtime of your job below 1 minute and submit it to the test partition by adding -p test
to sbatch:
<xterm> dreger@sheldon-ng:..dreger/quickstart> pwd /scratch/dreger/quickstart dreger@sheldon-ng:..dreger/quickstart> sbatch -p test job1.sh Submitted batch job 26495 dreger@sheldon-ng:..dreger/quickstart> squeue -l -u dreger Sun Jun 29 23:02:50 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 26495 test job1 dreger RUNNING 0:24 1:00 1 x001
dreger@sheldon-ng:..dreger/quickstart> cat job1_26495.out JobId=26495 Name=job1
UserId=dreger(4440) GroupId=fbedv(400) Priority=10916 Account=fbedv QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A SubmitTime=2014-06-29T23:02:26 EligibleTime=2014-06-29T23:02:26 StartTime=2014-06-29T23:02:26 EndTime=2014-06-29T23:03:26 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=test AllocNode:Sid=sheldon-ng:27448 ReqNodeList=(null) ExcNodeList=(null) NodeList=x001 BatchHost=x001 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/clusterfs/scratch/dreger/quickstart/job1.sh WorkDir=/clusterfs/scratch/dreger/quickstart
x001 </xterm>
Example of a GROMACS job script for one node using multithreading
TBD
Example of a GROMACS job script for multiple nodes using MPI
TBD