Differences

This shows you the differences between two versions of the page.

--- services:cluster:slurm [2014/06/29 21:01] – [Quick Start for the impatient] dreger
+++ services:cluster:slurm [2024/04/26 14:36] (current) – hoffmac00
@@ Line 5: / Line 5: @@
 ===== Quick Start for the impatient =====
-  - Log in to the head node sheldon-ng.physik.fu-berlin.de using ssh
+  - Log in to one of the head nodes using ssh:
+    * login node for tron cluster: tron.physik.fu-berlin.de
   - Create a job script file to be run by the queuing system, supply information like:
     * how much memory to allocate for your job
@@ Line 14: / Line 15: @@
   - Submit your job script using the ''sbatch'' command
-Consider the following example of a very basic job script named ''job1.sh'':
+==== Example of a very basic job script ====
-<xterm>
+Consider the following bash script with ''#SBATCH'' comments, which tell Slurm what resources you need:
+<file bash example_job.sh>
 #!/bin/bash
-#SBATCH --job-name=job1                # Job name, will show up in squeue output
+#SBATCH --job-name=example_job         # Job name, will show up in squeue output
-#SBATCH --ntasks=1                     # Number of cores
+#SBATCH --ntasks=1                     # Number of individual tasks, usually 1 except when using MPI, etc.
-#SBATCH --nodes=1                      # Ensure that all cores are on one machine
+#SBATCH --cpus-per-task=1              # Number of CPUs
+#SBATCH --nodes=1                      # Number of nodes, usaully 1 except when using MPI, etc.
 #SBATCH --time=0-00:01:00              # Runtime in DAYS-HH:MM:SS format
 #SBATCH --mem-per-cpu=100              # Memory per cpu in MB (see also --mem)
-#SBATCH --output=job1_%j.out           # File to which standard out will be written
+#SBATCH --output=job_%j.out            # File to which standard out will be written
-#SBATCH --error=job1_%j.err            # File to which standard err will be written
+#SBATCH --error=job_%j.err             # File to which standard err will be written
 #SBATCH --mail-type=END                # Type of email notification- BEGIN,END,FAIL,ALL
 #SBATCH --mail-user=j.d@fu-berlin.de   # Email to which notifications will be sent
@@ Line 37: / Line 41: @@
 # wait some time...
 sleep 50
-</xterm>
+</file>
-Now just submit your job script using ''sbatch job1.sh'' from the command line. Please try to run jobs directly from the /scratch///username// cluster wide filesystem to lower the load on the /home server. For testing purposes set the runtime of your job below 1 minute and submit it to the test partition by adding ''-p test'' to sbatch:
+Please note that your job will be killed by the queueing system if it tries to use more memory than requested or if it runs longer than the time specified in the batch script. So to be on the safe side you can set these values a litte bit higher. If you set the values to high, your job might not start because there are not enough resources (e.g. no machine has that amount of memory you are asking for).
-<xterm>
+Now just submit your job script using ''sbatch job1.sh'' from the command line. Please try to run jobs directly from the ''/scratch'' cluster wide filesystem, where you have a directory under ''/scratch/<username>'' to lower the load on ''/home''. For testing purposes set the runtime of your job below 1 minute and submit it to the test partition by adding ''-p test'' to sbatch:
-dreger@sheldon-ng:..dreger/quickstart> **pwd**
+<code>
+dreger@sheldon-ng:..dreger/quickstart> pwd
 /scratch/dreger/quickstart
-dreger@sheldon-ng:..dreger/quickstart> **sbatch -p test job1.sh**
+dreger@sheldon-ng:..dreger/quickstart> sbatch -p test example_job.sh
-Submitted batch job 26494
+Submitted batch job 26495
-dreger@sheldon-ng:..dreger/quickstart> **cat job1_26494.out**
+dreger@sheldon-ng:..dreger/quickstart> squeue -l -u dreger
-JobId=26494 Name=job1
+Sun Jun 29 23:02:50 2014
+             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
+      test     job1   dreger  RUNNING       0:24      1:00      1 x001
+dreger@sheldon-ng:..dreger/quickstart> cat job_26495.out
+JobId=26495 Name=example_job
    UserId=dreger(4440) GroupId=fbedv(400)
    Priority=10916 Account=fbedv QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
-   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
+   RunTime=00:00:01 TimeLimit=00:01:00 TimeMin=N/A
-   SubmitTime=2014-06-29T22:37:44 EligibleTime=2014-06-29T22:37:44
+   SubmitTime=2014-06-29T23:02:26 EligibleTime=2014-06-29T23:02:26
-   StartTime=2014-06-29T22:37:44 EndTime=2014-06-29T22:38:44
+   StartTime=2014-06-29T23:02:26 EndTime=2014-06-29T23:03:26
    PreemptTime=None SuspendTime=None SecsPreSuspend=0
-   Partition=test AllocNode:Sid=sheldon-ng:26448
+   Partition=test AllocNode:Sid=sheldon-ng:27448
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=x001
@@ Line 64: / Line 74: @@
    Features=(null) Gres=(null) Reservation=(null)
    Shared=OK Contiguous=0 Licenses=(null) Network=(null)
-   Command=/clusterfs/scratch/dreger/quickstart/job1.sh
+   Command=/clusterfs/scratch/dreger/quickstart/example_job.sh
    WorkDir=/clusterfs/scratch/dreger/quickstart
 x001
-</xterm>
+</code>