Table of Contents
Introduction to the Torque HPC cluster of the physics department
The login node of the HPC cluster is sheldon.physik.fu-berlin.de
. You can connect to it from anywhere using ssh, e.g. by issuing ssh sheldon.physik.fu-berlin.de
on the command line or using putty from windows.
Please send all questions regarding the HPC cluster to hpc@physik.fu-berlin.de
.
Getting information about cluster utilization
You can list the current workload on the HPC cluster using the command qf
:
hpcuser@sheldon:~> qf NODE OCCUPATION AV.LOAD STATE ======================================================== 042 [**** ] 4.00 `free` 043 [**** ] 4.00 `free` 044 [XXXXXXXX] ---- `down` 045 [**** ] 4.00 `free` 046 [********] 8.00 `job-exclusive ...
A star *
corresponds to one used core on the given node. A node in the state job-exclusive
is full and can not be used for new jobs. A node in down
or offline
state has crashed or is under maintenance.
Submitting a job to the HPC cluster
In order to do any calculations on the HPC cluster you have to submit your jobs to the queuing system using the qsub
command. You may not login to a compute node and start interactive calculations at any time.
You submit jobs to the queuing system by writing a job-script which tells the queuing system about the resources your job needs and about the programs which are to be run. Basically the job-script is a shell script with some magic comments at the top (lines starting with #PBS
) which are parsed by the queuing system.
Once you have written your job-script you simply submit it with qsub job-script-filename
.
The job-script
Let's first look at a very basic example of a job-script. It requests one core on one node for 1 hour and runs the "program" env
and waits for a minute, so you can directly use it for your first test:
#!/bin/bash #PBS -N some-good-name #PBS -l walltime=1:00:00 #PBS -l nodes=1:ppn=1 #PBS -m bea -M hpcuser@physik.fu-berlin.de ## go to the directory the user typed 'qsub' in cd $PBS_O_WORKDIR env > outputfile.txt sleep 60
Save this job-script in some subdir and submit it from there. Make sure you change the email-address given in the #PBS -m
line.
<xterm> hpcuser@sheldon:~/test1> qsub jobfile1 26103.torque.physik.fu-berlin.de hpcuser@sheldon:~/test1> qstat 26103 Job id Name User Time Use S Queue ————————- —————- ————— ——– - —– 26103.torque some-good-name hpcuser 0 R batch </xterm>
You will receive an email message at the specified address when the job starts and a second one when it's finished:
From: hpc-torque@physik.fu-berlin.de Subject: PBS JOB 26103.torque.physik.fu-berlin.de Date: Fri, 02 Mar 2012 18:38:38 +0100 To: hpcuser@physik.fu-berlin.de Message-Id: <E1S3WRC-0005Qe-89@torque.physik.fu-berlin.de> PBS Job Id: 26103.torque.physik.fu-berlin.de Job Name: some-good-name Exec host: n109/4 Execution terminated Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=8688kb resources_used.vmem=28128kb resources_used.walltime=00:01:02
In your test directory you should find some files that were produced by your job:
<xterm> hpcuser@sheldon:~/test1> ls -l -rw-r–r– 1 hpcuser fbedv 222 Mar 2 18:37 jobfile1 -rw-r–r– 1 hpcuser fbedv 1663 Mar 2 18:37 outputfile.txt -rw——- 1 hpcuser fbedv 0 Mar 2 18:37 some-good-name.e26103 -rw——- 1 hpcuser fbedv 256 Mar 2 18:37 some-good-name.o26103 </xterm>
some-good-name.o26103
contains the standard output of your job-script, while some-good-name.e26103
contains the standard error output. outputfile.txt
in this case contains the output of the env
command, so you can use this to list all PBS-environment variables usable by your job-scripts by grep ^PBS outputfile.txt
. Type man qsub
for more information on #PBS options and $PBS_* environment variables.
Requesting resources
The most important recources that can (and should!) be specified by your job-script using the #PBS -l
option are listed in the following table:
Resource | Format | Description | Example | Default |
---|---|---|---|---|
nodes | {<node_count> | <hostname>} [:ppn=<ppn>] | Number of nodes to be reserved for exclusive use by the job. ppn=# specifies the number of cores on each node. | nodes=10:ppn=12 → request 10 nodes with 12 cores each nodes=n100:ppn=8+n101:ppn=8 → request two explicit nodes by hostname (possible, but not recommended) | nodes=1:ppn=1 |
walltime | seconds, or [[HH:]MM:]SS | Maximum amount of real time during which the job can be in the running state. The job will be terminated once this limit is reached. | walltime=100:00:00 → request 100 hours for this job | walltime=1:00:00 (1 hour) |
pmem | size* | Maximum amount of physical memory used by any single process of the job. In our case this means per core. | pmem=8gb → request 8gb RAM per core | pmem=2gb |
file | size* | The amount of local disk space per core per node requested for the job. The space can be accessed at /local_scratch/$PBS_JOBID | file=10gb → request 10 gigabytes of local disk space on each compute node for each core used on a node | none |
size* format = integer, optionally followed by a multiplier {b,kb,mb,gb,tb} meaning {bytes,kilobytes,megabytes,gigabytes,terabytes}. no suffix means bytes.
Recommendations on resource usage
Note that in general it is a bad idea to specify far too large values for pmem or walltime just to be on the safe side, since this will very likely delay execution of your jobs. An explanation for this behaviour will be given in an upcoming section on backfill strategy of the queuing system.
Please try to use local disk space on the compute nodes whenever possible. Since access to local storage is faster than access to your $PBS_O_WORKDIR, this will most likely speed up your compute jobs. At the same time it reduces the load on the central home-server. However, do not forget to copy back data from the compute nodes to $PBS_O_WORKDIR after the job has finised, since the local disk space will be cleared once your job-script ha finished. The following advanced job-script is using local disk space.
Advanced job-script example running CP2K using MPI on 12 nodes with 8 cores each
#!/bin/bash #PBS -N some_good_name #PBS -l nodes=12:ppn=8 #PBS -l walltime=100:00:00 #PBS -l file=1000M #PBS -m ea -M hpcuser@physik.fu-berlin.de cd $PBS_O_WORKDIR flag=some_good_name nprocs=`cat $PBS_NODEFILE | wc -l ` if [ ! -f ./seq ]; then echo "01" > seq ; fi export seq=`cat seq` awk 'BEGIN{printf "%2.2d\n",ENVIRON["seq"]+1}' > seq infile=${flag}.inp outfile="${flag}.$seq.${nprocs}-pe.out" cat $PBS_NODEFILE >> ~/traceback-torque/nodes.${outfile} export >> ~/traceback-torque/nodes.${outfile} workdir=/local_scratch/$PBS_JOBID cd $PBS_O_WORKDIR cp -r $PBS_O_WORKDIR/* $workdir cd $workdir mpirun cp2k.popt $infile > $outfile cp -r * $PBS_O_WORKDIR
Viewing the Queue Status
<xterm> qstat </xterm>
Run a job interactively
If you want to run a job on a compute node interactively (i.e. for debugging purposes), simply put a -I
option in the qsub command line together
with the torque resource options you are normally using in the #PBS lines in your job script:
<xterm> hpcuser@sheldon:~> qsub -I -l cput=20:00:00 -l nodes=1:ppn=1 -N jobname </xterm>
qsub does not return in this case; instead, as soon as you get scheduled, you get an interactive shell on a node.
Requesting resources
The following resources can be requested from the queueing system:
Number of Nodes and CPUs per node
Default value: 1 node with 1 cpu.
<xterm> glaweh@n041:~> qsub -lnodes=1:ppn=2 </xterm> Requests one node with two cpus.
Infiniband Interconnect
In case your MPI-job runs on more than one node and profits from Infiniband interconnection, add ":infiniband" to your node resource list: <xterm> glaweh@n041:~> qsub -lnodes=2:ppn=2:infiniband </xterm> Requests two nodes interconnected with infiniband with two cpus each.
Memory per processor
<xterm> glaweh@n041:~> qsub -lpmem=10g </xterm> Requests 10 gigabytes of memory per processor.
Default value: 2 gigabyte
When does my job start
You can look up the expectetd start time with "showstart JOBID" <xterm> pneuser@sheldon:~> showstart 9970.n022 job 9970 requires 1 proc for 12:12:00:00 Earliest start in 00:03:03 on Thu Dec 15 12:13:20 Earliest completion in 12:11:56:57 on Wed Dec 28 00:13:20 Best Partition: Gross </xterm>