Introduction to GPU accelerated jobs

Currently we have 31 nodes in the yoshi cluster (ygpu01-ygpu31) equipped with GPU boards. The exact hardware config is:

2x NVidia Tesla M2070
2x Xeon X5570
24GB RAM
QDR Infiniband between all GPU nodes

In oder to use the GPU cards, you need to allocate them through the queuing system using the –gres=gpu:2 option. You could also just use one card if you submit with –gres=gpu:1. You also have to explicitly state the partition to run in using –partition=gpu-main (or gpu-test for the GPU test queue).

GROMACS example using GPU acceleration

Here I give a simple example using GROMACS. First I'll use an interactive session to explore the GPU feature, in the end I'll supply a complete batch script for use with sbatch.

dreger@yoshi:~/gpu> sinfo | grep gpu
gpu-test     up    2:00:00      1   idle ygpu01
gpu-main     up   infinite     30   idle ygpu[02-31]

The test partition gpu-test which consists of the single node ygpu01 will most likely be free, since it has a timelimit of 2 hours. So we'll use that for testing:

dreger@yoshi:~/gpu> srun –time=02:00:00 –nodes=1 –tasks=8 –gres=gpu:2 –partition=gpu-test –mem=1G –pty /bin/bash
dreger@ygpu01:~/gpu> env | grep CUDA
CUDA_VISIBLE_DEVICES=0,1
dreger@ygpu01:~/gpu> nvidia-smi
Thu Jun 18 14:16:19 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.65     Driver Version: 340.65         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2070         Off  | 0000:14:00.0     Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      9MiB /  5375MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2070         Off  | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      9MiB /  5375MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

The nvidia-smi command gives some information on the GPUs. Currently no process is running on the GPUs. We'll start a simple GROMACS computation:

dreger@ygpu01:~/gpu> module load gromacs/non-mpi/4.6.7-cuda
dreger@ygpu01:~/gpu> genbox -box 9 9 9 -p -cs spc216 -o waterbox.gro
dreger@ygpu01:~/gpu> grompp -f run.mdp -c waterbox.gro -p topol.top
dreger@ygpu01:~/gpu> mdrun
[...]
Using 2 MPI threads
Using 4 OpenMP threads per tMPI thread

2 GPUs detected:
  #0: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible
  #1: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible

2 GPUs auto-selected for this run.
Mapping of GPUs to the 2 PP ranks in this node: #0, #1
[...]
               Core t (s)   Wall t (s)        (%)
       Time:      262.880       34.401      764.2
                 (ns/day)    (hour/ns)
Performance:       25.121        0.955

While your jobs run you can log in to the node and call nvidia-smi to see if the GPUs are used at all:

dreger@ygpu01:~> nvidia-smi
Thu Jun 18 14:25:21 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.65     Driver Version: 340.65         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2070         Off  | 0000:14:00.0     Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |     67MiB /  5375MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2070         Off  | 0000:15:00.0     Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |     67MiB /  5375MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     11481  mdrun                                                 55MiB |
|    1     11481  mdrun                                                 55MiB |
+-----------------------------------------------------------------------------+

Please check your job logfiles to see if your program has some problems using the GPUs. In case of GROMACS this might look like:

NOTE: GPU(s) found, but the current simulation can not use GPUs
      To use a GPU, set the mdp option: cutoff-scheme = Verlet
      (for quick performance testing you can use the -testverlet option)

Using 8 MPI threads

2 GPUs detected:
  #0: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible
  #1: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible

2 compatible GPUs detected in the system, but none will be used.
Consider trying GPU acceleration with the Verlet scheme!

In this case a cutoff-scheme was specified that can not be used with GPU acceleration.

Compare the timings with a test run on the same node, that does not use the GPUs. In some cases the GPUs will not help at all, even though nvidia-smi shows a high utilization. For this example without GPU (note the missing -cuda in the module load command) we get:

dreger@ygpu01:~/gpu> module load gromacs/non-mpi/4.6.7
dreger@ygpu01:~/gpu> grompp -f run.mdp -c waterbox.gro -p topol.top
dreger@ygpu01:~/gpu> mdrun

               Core t (s)   Wall t (s)        (%)
       Time:      844.970      106.315      794.8
                 (ns/day)    (hour/ns)
Performance:        8.128        2.953

So in this case the calculation runs about three times faster with two GPU cards.

Example batch file

A job script for the example given above could look like:

#!/bin/bash

#SBATCH --mail-user=dreger@physik.fu-berlin.de
#SBATCH --mail-type=end

#SBATCH --output=job%j.out
#SBATCH --error=job%j.err
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=1024
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --partition=gpu-main

module load gromacs/non-mpi/4.6.7-cuda

TAG="${SLURM_JOB_ID}-$(hostname -s)-cuda"

grompp -f run.mdp -c waterbox.gro -p topol.top -o output-$TAG
mdrun -nt ${SLURM_CPUS_ON_NODE} -testverlet -v -deffnm output-$TAG

Please make sure you change the email if you use this for your own tests ;)

DokuWiki

Table of Contents

Introduction to GPU accelerated jobs

GROMACS example using GPU acceleration

Example batch file