Introduction to the Torque HPC cluster of the physics department

The login node of the HPC cluster is sheldon.physik.fu-berlin.de. You can connect to it from anywhere using ssh, e.g. by issuing ssh sheldon.physik.fu-berlin.de on the command line or using putty from windows.

Please send all questions regarding the HPC cluster to hpc@physik.fu-berlin.de.

Getting information about cluster utilization

You can list the current workload on the HPC cluster using the command qf:

hpcuser@sheldon:~> qf

    NODE      OCCUPATION     AV.LOAD     STATE
========================================================
     042      [****    ]      4.00       `free`
     043      [****    ]      4.00       `free`
     044      [XXXXXXXX]      ----       `down`           
     045      [****    ]      4.00       `free`
     046      [********]      8.00       `job-exclusive
...

A star * corresponds to one used core on the given node. A node in the state job-exclusive is full and can not be used for new jobs. A node in down or offline state has crashed or is under maintenance.

Submitting a job to the HPC cluster

In order to do any calculations on the HPC cluster you have to submit your jobs to the queuing system using the qsub command. You may not login to a compute node and start interactive calculations at any time.

You submit jobs to the queuing system by writing a job-script which tells the queuing system about the resources your job needs and about the programs which are to be run. Basically the job-script is a shell script with some magic comments at the top (lines starting with #PBS) which are parsed by the queuing system.

Once you have written your job-script you simply submit it with qsub job-script-filename.

The job-script

Let's first look at a very basic example of a job-script. It requests one core on one node for 1 hour and runs the "program" env and waits for a minute, so you can directly use it for your first test:

#!/bin/bash
#PBS -N some-good-name
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=1
#PBS -m bea -M hpcuser@physik.fu-berlin.de

## go to the directory the user typed 'qsub' in
cd $PBS_O_WORKDIR

env > outputfile.txt
sleep 60

Save this job-script in some subdir and submit it from there. Make sure you change the email-address given in the #PBS -m line.

<xterm> hpcuser@sheldon:~/test1> qsub jobfile1 26103.torque.physik.fu-berlin.de hpcuser@sheldon:~/test1> qstat 26103 Job id Name User Time Use S Queue ————————- —————- ————— ——– - —– 26103.torque some-good-name hpcuser 0 R batch </xterm>

You will receive an email message at the specified address when the job starts and a second one when it's finished:

From: hpc-torque@physik.fu-berlin.de
Subject: PBS JOB 26103.torque.physik.fu-berlin.de
Date: Fri, 02 Mar 2012 18:38:38 +0100
To: hpcuser@physik.fu-berlin.de
Message-Id: <E1S3WRC-0005Qe-89@torque.physik.fu-berlin.de>

PBS Job Id: 26103.torque.physik.fu-berlin.de
Job Name:   some-good-name
Exec host:  n109/4
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=8688kb
resources_used.vmem=28128kb
resources_used.walltime=00:01:02

In your test directory you should find some files that were produced by your job:

<xterm> hpcuser@sheldon:~/test1> ls -l -rw-r–r– 1 hpcuser fbedv 222 Mar 2 18:37 jobfile1 -rw-r–r– 1 hpcuser fbedv 1663 Mar 2 18:37 outputfile.txt -rw——- 1 hpcuser fbedv 0 Mar 2 18:37 some-good-name.e26103 -rw——- 1 hpcuser fbedv 256 Mar 2 18:37 some-good-name.o26103 </xterm>

some-good-name.o26103 contains the standard output of your job-script, while some-good-name.e26103 contains the standard error output. outputfile.txt in this case contains the output of the env command, so you can use this to list all PBS-environment variables usable by your job-scripts by grep ^PBS outputfile.txt. Type man qsub for more information on #PBS options and $PBS_* environment variables.

Requesting resources

The most important recources that can (and should!) be specified by your job-script using the #PBS -l option are listed in the following table:

Resource	Format	Description	Example	Default
nodes	{<node_count> \| <hostname>} [:ppn=<ppn>]	Number of nodes to be reserved for exclusive use by the job. ppn=# specifies the number of cores on each node.	nodes=10:ppn=12 → request 10 nodes with 12 cores each nodes=n100:ppn=8+n101:ppn=8 → request two explicit nodes by hostname (possible, but not recommended)	nodes=1:ppn=1
walltime	seconds, or [[HH:]MM:]SS	Maximum amount of real time during which the job can be in the running state. The job will be terminated once this limit is reached.	walltime=100:00:00 → request 100 hours for this job	walltime=1:00:00 (1 hour)
pmem	size*	Maximum amount of physical memory used by any single process of the job. In our case this means per core.	pmem=8gb → request 8gb RAM per core	pmem=2gb
file	size*	The amount of local disk space per core per node requested for the job. The space can be accessed at /local_scratch/$PBS_JOBID	file=10gb → request 10 gigabytes of local disk space on each compute node for each core used on a node	none

size* format = integer, optionally followed by a multiplier {b,kb,mb,gb,tb} meaning {bytes,kilobytes,megabytes,gigabytes,terabytes}. no suffix means bytes.

Recommendations on resource usage

Note that in general it is a bad idea to specify far too large values for pmem or walltime just to be on the safe side, since this will very likely delay execution of your jobs. An explanation for this behaviour will be given in an upcoming section on backfill strategy of the queuing system.

Please try to use local disk space on the compute nodes whenever possible. Since access to local storage is faster than access to your $PBS_O_WORKDIR, this will most likely speed up your compute jobs. At the same time it reduces the load on the central home-server. However, do not forget to copy back data from the compute nodes to$ PBS_O_WORKDIR after the job has finised, since the local disk space will be cleared once your job-script ha finished. The following advanced job-script is using local disk space.

Advanced job-script example running CP2K using MPI on 12 nodes with 8 cores each

#!/bin/bash
#PBS -N some_good_name
#PBS -l nodes=12:ppn=8
#PBS -l walltime=100:00:00
#PBS -l file=1000M
#PBS -m ea -M hpcuser@physik.fu-berlin.de

cd $PBS_O_WORKDIR

flag=some_good_name

nprocs=`cat $PBS_NODEFILE | wc -l `

if [ ! -f ./seq ];  then echo "01" > seq ; fi
export seq=`cat seq`
awk 'BEGIN{printf "%2.2d\n",ENVIRON["seq"]+1}' > seq

infile=${flag}.inp
outfile="${flag}.$seq.${nprocs}-pe.out"

cat $PBS_NODEFILE     >>   ~/traceback-torque/nodes.${outfile}
export                >>   ~/traceback-torque/nodes.${outfile}

workdir=/local_scratch/$PBS_JOBID

cd $PBS_O_WORKDIR
cp -r $PBS_O_WORKDIR/* $workdir
cd $workdir

mpirun cp2k.popt $infile > $outfile

cp -r * $PBS_O_WORKDIR

Viewing the Queue Status

<xterm> qstat </xterm>

Run a job interactively

If you want to run a job on a compute node interactively (i.e. for debugging purposes), simply put a -I option in the qsub command line together with the torque resource options you are normally using in the #PBS lines in your job script:

<xterm> hpcuser@sheldon:~> qsub -I -l cput=20:00:00 -l nodes=1:ppn=1 -N jobname </xterm>

qsub does not return in this case; instead, as soon as you get scheduled, you get an interactive shell on a node.

Requesting resources

The following resources can be requested from the queueing system:

Number of Nodes and CPUs per node

Default value: 1 node with 1 cpu.

<xterm> glaweh@n041:~> qsub -lnodes=1:ppn=2 </xterm> Requests one node with two cpus.

Infiniband Interconnect

In case your MPI-job runs on more than one node and profits from Infiniband interconnection, add ":infiniband" to your node resource list: <xterm> glaweh@n041:~> qsub -lnodes=2:ppn=2:infiniband </xterm> Requests two nodes interconnected with infiniband with two cpus each.

Memory per processor

<xterm> glaweh@n041:~> qsub -lpmem=10g </xterm> Requests 10 gigabytes of memory per processor.

Default value: 2 gigabyte

When does my job start

You can look up the expectetd start time with "showstart JOBID" <xterm> pneuser@sheldon:~> showstart 9970.n022 job 9970 requires 1 proc for 12:12:00:00 Earliest start in 00:03:03 on Thu Dec 15 12:13:20 Earliest completion in 12:11:56:57 on Wed Dec 28 00:13:20 Best Partition: Gross </xterm>