====== Introduction to the Torque HPC cluster of the physics department ======

The login node of the HPC cluster is ''sheldon.physik.fu-berlin.de''. You can connect to it from anywhere using ssh, e.g. by issuing ''ssh sheldon.physik.fu-berlin.de'' on the command line or using putty from windows.

Please send all questions regarding the HPC cluster to ''hpc@physik.fu-berlin.de''.

===== Getting information about cluster utilization =====
 
You can list the current workload on the HPC cluster using the command ''qf'': 

<code>
hpcuser@sheldon:~> qf

    NODE      OCCUPATION     AV.LOAD     STATE
========================================================
     042      [****    ]      4.00       `free`
     043      [****    ]      4.00       `free`
     044      [XXXXXXXX]      ----       `down`           
     045      [****    ]      4.00       `free`
     046      [********]      8.00       `job-exclusive
...
</code>

A star ''*'' corresponds to one used core on the given node. A node in the state ''job-exclusive'' is full and can not be used for new jobs. A node in ''down'' or ''offline'' state has crashed or is under maintenance.

===== Submitting a job to the HPC cluster =====

In order to do any calculations on the HPC cluster **you have to submit your jobs to the queuing system** using the ''qsub'' command. You may not login to a compute node and start interactive calculations at any time.

You submit jobs to the queuing system by writing a job-script which tells the queuing system about the resources your job needs and about the programs which are to be run. Basically the job-script is a shell script with some magic comments at the top (lines starting with ''#PBS'') which are parsed by the queuing system.

Once you have written your job-script you simply submit it with ''qsub job-script-filename''.

==== The job-script ====

Let's first look at a very basic example of a job-script. It requests one core on one node for 1 hour and runs the "program" ''env'' and waits for a minute, so you can directly use it for your first test:

<code>
#!/bin/bash
#PBS -N some-good-name
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1
#PBS -m bea -M hpcuser@physik.fu-berlin.de

## go to the directory the user typed 'qsub' in
cd $PBS_O_WORKDIR

env > outputfile.txt
sleep 60
</code>

Save this job-script in some subdir and submit it from there. Make sure you change the email-address given in the ''#PBS -m'' line.

<xterm>
hpcuser@sheldon:~/test1> **qsub jobfile1**
26103.torque.physik.fu-berlin.de
hpcuser@sheldon:~/test1> **qstat 26103**
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
26103.torque              some-good-name   hpcuser                0 R batch          
</xterm>

You will receive an email message at the specified address when the job starts and a second one when it's finished:

<code>
From: hpc-torque@physik.fu-berlin.de
Subject: PBS JOB 26103.torque.physik.fu-berlin.de
Date: Fri, 02 Mar 2012 18:38:38 +0100
To: hpcuser@physik.fu-berlin.de
Message-Id: <E1S3WRC-0005Qe-89@torque.physik.fu-berlin.de>

PBS Job Id: 26103.torque.physik.fu-berlin.de
Job Name:   some-good-name
Exec host:  n109/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=8688kb
resources_used.vmem=28128kb
resources_used.walltime=00:01:02
</code>

In your test directory you should find some files that were produced by your job:

<xterm>
hpcuser@sheldon:~/test1> **ls -l**
-rw-r--r-- 1 hpcuser fbedv  222 Mar  2 18:37 jobfile1
-rw-r--r-- 1 hpcuser fbedv 1663 Mar  2 18:37 outputfile.txt
-rw------- 1 hpcuser fbedv    0 Mar  2 18:37 some-good-name.e26103
-rw------- 1 hpcuser fbedv  256 Mar  2 18:37 some-good-name.o26103
</xterm>

''some-good-name.o26103'' contains the standard output of your job-script, while ''some-good-name.e26103'' contains the standard error output. ''outputfile.txt'' in this case contains the output of the ''env'' command, so you can use this to list all PBS-environment variables usable by your job-scripts by ''grep ^PBS outputfile.txt''. Type ''man qsub'' for more information on #PBS options and $PBS_* environment variables.

=== Requesting resources ===

The most important recources that can (and should!) be specified by your job-script using the ''#PBS -l'' option are listed in the following table:

^ Resource ^ Format ^ Description ^ Example ^ Default ^
| nodes    | {<node_count> %%|%% <hostname>} [:ppn=<ppn>] | Number of nodes to be reserved for exclusive use by the job. ppn=# specifies the number of cores on each node. | **nodes=10:ppn=12** -> request 10 nodes with 12 cores each\\ **nodes=n100:ppn=8+n101:ppn=8** -> request two explicit nodes by hostname (possible, but not recommended) | nodes=1:ppn=1 |
| walltime | seconds, or [[HH:]MM:]SS | Maximum amount of real time during which the job can be in the running state. The job will be terminated once this limit is reached. | **walltime=100:00:00** -> request 100 hours for this job | walltime=1:00:00 (1 hour) |
| pmem | size* | Maximum amount of physical memory used by any single process of the job. In our case this means per core. | **pmem=8gb** -> request 8gb RAM per core | pmem=2gb |
| file | size* | The amount of **local disk space per core per node** requested for the job. The space can be accessed at /local_scratch/$PBS_JOBID | **file=10gb** -> request 10 gigabytes of local disk space on each compute node for each core used on a node | none |

**size* format** = integer, optionally followed by a multiplier {b,kb,mb,gb,tb} meaning {bytes,kilobytes,megabytes,gigabytes,terabytes}. no suffix means bytes.

=== Recommendations on resource usage ===

Note that in general it is a bad idea to specify far too large values for pmem or walltime //just to be on the safe side//, since this will very likely delay execution of your jobs. An explanation for this behaviour will be given in an upcoming section on backfill strategy of the queuing system.

Please try to use local disk space on the compute nodes whenever possible. Since access to local storage is faster than access to your $PBS_O_WORKDIR, this will most likely speed up your compute jobs. At the same time it reduces the load on the central home-server. However, do not forget to copy back data from the compute nodes to $PBS_O_WORKDIR after the job has finised, since the local disk space will be cleared once your job-script ha finished. The following advanced job-script is using local disk space.

=== Advanced job-script example running CP2K using MPI on 12 nodes with 8 cores each ===

<code>
#!/bin/bash
#PBS -N some_good_name
#PBS -l nodes=12:ppn=8
#PBS -l walltime=100:00:00
#PBS -l file=1000M
#PBS -m ea -M hpcuser@physik.fu-berlin.de

cd $PBS_O_WORKDIR

flag=some_good_name

nprocs=`cat $PBS_NODEFILE | wc -l `

if [ ! -f ./seq ];  then echo "01" > seq ; fi
export seq=`cat seq`
awk 'BEGIN{printf "%2.2d\n",ENVIRON["seq"]+1}' > seq

infile=${flag}.inp
outfile="${flag}.$seq.${nprocs}-pe.out"

cat $PBS_NODEFILE     >>   ~/traceback-torque/nodes.${outfile}
export                >>   ~/traceback-torque/nodes.${outfile}

workdir=/local_scratch/$PBS_JOBID

cd $PBS_O_WORKDIR
cp -r $PBS_O_WORKDIR/* $workdir
cd $workdir

mpirun cp2k.popt $infile > $outfile

cp -r * $PBS_O_WORKDIR
</code>

===== Viewing the Queue Status =====
<xterm>
qstat
</xterm>

===== Run a job interactively =====
If you want to run a job on a compute node interactively (i.e. for debugging purposes), simply put a **''-I''** option in the qsub command line together
with the torque resource options you are normally using in the #PBS lines in your job script:

<xterm>
hpcuser@sheldon:~> qsub **-I** -l cput=20:00:00 -l nodes=1:ppn=1 -N jobname
</xterm>

qsub does not return in this case; instead, as soon as you get scheduled, you get an interactive shell on a node.

===== Requesting resources =====
The following resources can be requested from the queueing system:
==== Number of Nodes and CPUs per node ====
Default value: 1 node with 1 cpu.

<xterm>
glaweh@n041:~> qsub -lnodes=1:ppn=2
</xterm>
Requests one node with two cpus.

==== Infiniband Interconnect ====
In case your MPI-job runs on more than one node and profits from
Infiniband interconnection, add ":infiniband" to your node resource list:
<xterm>
glaweh@n041:~> qsub -lnodes=2:ppn=2:infiniband
</xterm>
Requests two nodes interconnected with infiniband with two cpus each.

==== Memory per processor ====
<xterm>
glaweh@n041:~> qsub -lpmem=10g
</xterm>
Requests 10 gigabytes of memory per processor.

Default value: 2 gigabyte

===== When does my job start =====
You can look up the expectetd start time with "showstart JOBID"
<xterm>
pneuser@sheldon:~> showstart 9970.n022
job 9970 requires 1 proc for 12:12:00:00
Earliest start in        00:03:03 on Thu Dec 15 12:13:20
Earliest completion in 12:11:56:57 on Wed Dec 28 00:13:20
Best Partition: Gross
</xterm>