====== Introduction to GPU accelerated jobs ====== Currently we have 31 nodes in the yoshi cluster (ygpu01-ygpu31) equipped with GPU boards. The exact hardware config is: * 2x NVidia {{:services:cluster:tesla-m2070-board-specification.pdf|Tesla M2070}} * 2x Xeon X5570 * 24GB RAM * QDR Infiniband between all GPU nodes In oder to use the GPU cards, you need to allocate them through the queuing system using the ''--gres=gpu:2'' option. You could also just use one card if you submit with ''--gres=gpu:1''. You also have to explicitly state the partition to run in using ''--partition=gpu-main'' (or gpu-test for the GPU test queue). ===== GROMACS example using GPU acceleration ===== Here I give a simple example using GROMACS. First I'll use an [[interactivesessions|interactive session]] to explore the GPU feature, in the end I'll supply a complete batch script for use with ''sbatch''. dreger@yoshi:~/gpu> **sinfo | grep gpu** gpu-test up 2:00:00 1 idle ygpu01 gpu-main up infinite 30 idle ygpu[02-31] The test partition gpu-test which consists of the single node ygpu01 will most likely be free, since it has a timelimit of 2 hours. So we'll use that for testing: dreger@yoshi:~/gpu> **srun --time=02:00:00 --nodes=1 --tasks=8 --gres=gpu:2 --partition=gpu-test --mem=1G --pty /bin/bash** dreger@ygpu01:~/gpu> **env | grep CUDA** CUDA_VISIBLE_DEVICES=0,1 dreger@ygpu01:~/gpu> **nvidia-smi** Thu Jun 18 14:16:19 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M2070 Off | 0000:14:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 9MiB / 5375MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M2070 Off | 0000:15:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 9MiB / 5375MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | No running compute processes found | +-----------------------------------------------------------------------------+ The ''nvidia-smi'' command gives some information on the GPUs. Currently no process is running on the GPUs. We'll start a simple GROMACS computation: dreger@ygpu01:~/gpu> **module load gromacs/non-mpi/4.6.7-cuda** dreger@ygpu01:~/gpu> **genbox -box 9 9 9 -p -cs spc216 -o waterbox.gro** dreger@ygpu01:~/gpu> **grompp -f {{:services:cluster:run.mdp|}} -c waterbox.gro -p {{:services:cluster:topol.top|}}** dreger@ygpu01:~/gpu> **mdrun** [...] Using 2 MPI threads Using 4 OpenMP threads per tMPI thread 2 GPUs detected: #0: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible #1: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible 2 GPUs auto-selected for this run. Mapping of GPUs to the 2 PP ranks in this node: #0, #1 [...] Core t (s) Wall t (s) (%) Time: 262.880 34.401 764.2 (ns/day) (hour/ns) Performance: 25.121 0.955 While your jobs run you can log in to the node and call ''nvidia-smi'' to see if the GPUs are used at all: dreger@ygpu01:~> **nvidia-smi** Thu Jun 18 14:25:21 2015 +------------------------------------------------------+ | NVIDIA-SMI 340.65 Driver Version: 340.65 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M2070 Off | 0000:14:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 67MiB / 5375MiB | 76% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla M2070 Off | 0000:15:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 67MiB / 5375MiB | 77% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 11481 mdrun 55MiB | | 1 11481 mdrun 55MiB | +-----------------------------------------------------------------------------+ Please check your job logfiles to see if your program has some problems using the GPUs. In case of GROMACS this might look like: **NOTE: GPU(s) found, but the current simulation can not use GPUs To use a GPU, set the mdp option: cutoff-scheme = Verlet (for quick performance testing you can use the -testverlet option)** Using 8 MPI threads 2 GPUs detected: #0: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible #1: NVIDIA Tesla M2070, compute cap.: 2.0, ECC: yes, stat: compatible **2 compatible GPUs detected in the system, but none will be used. Consider trying GPU acceleration with the Verlet scheme!** In this case a cutoff-scheme was specified that can not be used with GPU acceleration. Compare the timings with a test run on the same node, that does not use the GPUs. In some cases the GPUs will not help at all, even though ''nvidia-smi'' shows a high utilization. For this example without GPU (note the missing -cuda in the module load command) we get: dreger@ygpu01:~/gpu> **module load gromacs/non-mpi/4.6.7** dreger@ygpu01:~/gpu> **grompp -f run.mdp -c waterbox.gro -p topol.top** dreger@ygpu01:~/gpu> **mdrun** Core t (s) Wall t (s) (%) Time: 844.970 106.315 794.8 (ns/day) (hour/ns) Performance: 8.128 2.953 So in this case the calculation runs about three times faster with two GPU cards. ===== Example batch file ===== A job script for the example given above could look like: #!/bin/bash #SBATCH --mail-user=dreger@physik.fu-berlin.de #SBATCH --mail-type=end #SBATCH --output=job%j.out #SBATCH --error=job%j.err #SBATCH --ntasks=8 #SBATCH --mem-per-cpu=1024 #SBATCH --time=01:00:00 #SBATCH --gres=gpu:2 #SBATCH --nodes=1 #SBATCH --partition=gpu-main module load gromacs/non-mpi/4.6.7-cuda TAG="${SLURM_JOB_ID}-$(hostname -s)-cuda" grompp -f run.mdp -c waterbox.gro -p topol.top -o output-$TAG mdrun -nt ${SLURM_CPUS_ON_NODE} -testverlet -v -deffnm output-$TAG Please make sure you change the email if you use this for your own tests ;)