Table of Contents

Python on the Cluster

This document describes how to use python for numerical and/or batch jobs on the cluster with a heavy focus on using IPython for management of the multiprocessing. Other solutions are possible, but not the focus of this introductory document. Furthermore, this document focuses on how to setup a parallel IPython environment, but will not discuss its usage.

This is version 0.2(1390214905) of this document. If you have any suggestions, send them to behrmann@physik.fu-berlin.de.

Setting up the Environment

A ready-made python environment has been built for the cluster and is located at /opt/python2.7. This path holds python 2.7.6 and the following packages

behrmj87@sheldon:~> pip freeze
Cython==0.19.2
cffi==0.8.1
ipython==1.1.0
matplotlib==1.1.1
mpi4py==0.6.0
numpy==1.8.0
pandas==0.13.0
pip==1.5.0
pycparser==2.10
pyparsing==2.0.1
python-dateutil==2.2
pytz==2013.9
pyzmq==14.0.1
scipy==0.13.2
six==1.5.2
sympy==0.7.4.1
toolz==0.5.1
tornado==3.1.1
wsgiref==0.1.2

To use these newer python either evaluate the following to commands in your shell or add them to your .bashrc (or other respective shell configuration file)

export PATH=/opt/python2.7/bin:${PATH}
export PYTHONPATH=/opt/python2.7/lib/python2.7/site-packages

The first line adds all binaries from the python installation in /opt/ to your path and the second one sets the path in which this python installation looks for its libraries.

The python package pip is also provided, which allows you to install packages you may need on your own. To install python packages loacally in your $HOME use

pip --user --ignore-installed packagename

which will install the package packagename into $HOME/.local/lib/python2.7/site-packages. If you install packages locally to your home folder, than you need to change the PYTHONPATH variable mentioned above as follows PYTHONPATH=${HOME}/.local/lib/python2.7/site-packages:/opt/python2.7/lib/python2.7/site-packages.

The subset of already installed packages can also be found packaged as wheel-packages [1] in /opt/python2.7/share/wheel and can be installed with the command

pip install --user --ignore-installed --use-wheel --no-index --find-links=/opt/python2.7/share/wheel packagename

This is for particular use in virtualenvs, for which, for one reason or the other you do not want to use the provided system packages.

Setting up IPython

IPython can be easily set up for cluster usage. The IPython parallel documentation can be found here, with the setup explained here.

IPython does provide support for the cluster's queueing interface pbs, but the rationale behind their configuration remains unclear to the author, so we provide a somewhat more simple configuration.

To create an IPython profile you can run the below bash script

ipythonsetup.sh
#!/bin/bash
 
ipython profile create --parallel --profile=mpi
cd %{HOME}/.ipython/profile_mpi
patch < /opt/python2.7/share/configs/ipcluster/ipcluster_config.patch
patch < /opt/python2.7/share/configs/ipcluster/ipcontroller_config.patch
patch < /opt/python2.7/share/configs/ipcluster/ipengine_config.patch

which will immediately patch the the configuration to start the engines using mpiexec and the controller to listen not only on the local host, which is important if you are using multiple nodes.

Alternatively you can change the lines yourself. In $HOME/.ipython/ipcluster_config.py you need to set

c.IPClusterEngines.engine_launcher_class = 'MPIEngineSetLauncher'

and in $HOME/.ipython/ipcontroller_config.py you need to set

c.HubFactory.ip = '*'

Both files are a recommended reading for further options that can be changed.

Furthermore, you may want to set c.IPEngineApp.startup_script = u'/path/to/my/startup.py' to a startup script for importing all needed packages into the engines.

Using IPython on the Cluster

The usage of IPython in a parallel environment is explained here (for direct access to the IPython engines) and here (for task farming).

...in batch mode

Having set up our environment, we will now describe how to write job files using IPythons cluster capabilities. Below you can find a sample jobfile. To understand the header with the information for PBS, please read the wiki page on the queueing system.

ipython_testjob
#!/bin/bash
#PBS -N ipython_testjob
#PBS -l walltime=0:02:00
#PBS -l nodes=2:ppn=4
#PBS -m bea -M behrmann@physik.fu-berlin.de
 
## set environment
export PATH=/opt/python2.7/bin/:${PATH}
export PYTHONPATH=/opt/python2.7/lib/python2.7/site-packages/
 
## go to the directory the user typed 'qsub' in
cd $PBS_O_WORKDIR
 
## start IPython cluster with eight engines using the mpi profile
## remember 8 (engines) = 2 (nodes) x 4 (processors)
ipcluster start -n 8 --profile=mpi &
## the engines are ready after 30s, we will give them an
## extra 10s as grace period
sleep 40
## run the script you want to run
## python can be used instead of ipython, but in case of errors
## ipython offers more valuable tracebacks
ipython ./test.py
## stop the IPython cluster again
ipcluster stop --profile=mpi

The below sample python script prints the PIDs of the engines and information on the nodes on which they are running. Furthermore it tests for all numbers below 5 million for primality once sequentially and once in parallel, printing times for both runs and whether their results agree.

test.py
from __future__ import print_function, division, absolute_import
from IPython.parallel import Client
import time
 
# connect to cluster running the mpi profile
# connection will fail without explicit mentioning of
# which profile to connect to
rc = Client(profile='mpi')
dview = rc[:]
 
@dview.remote(block=True)
def getpid():
    import os
    return os.getpid()
 
print(getpid())
 
@dview.remote(block=True)
def gethostname():
    import os
    return os.uname()
 
print(set(gethostname()))
 
# external imports on engines
# if local=True is not set the serial calculation below will fail
with dview.sync_imports(local=True):
    import numpy as np
    import scipy as sp
    import sympy
 
starttime = time.time()
serial_result = map(lambda n: sympy.ntheory.isprime(n), xrange(5000000))
serialtime = time.time() - starttime
starttime = time.time()
parallel_result = dview.map_sync(lambda n: sympy.ntheory.isprime(n), xrange(5000000))
paralleltime = time.time() - starttime
correctness = (serial_result == parallel_result)
 
print('Serial and parallel result are equal: {0}.'.format(correctness))
print('Serial calculation took {0} seconds, parallel took {1} seconds'.format(serialtime, paralleltime))

...in an interactive way

As described on the page of the queueing system, the queue can be used interactively. This can be used for debugging purposes and massively parallel interactive work. The IPython cluster has then to be started manually using the above command from the job script.

Notes

MPI can be used with IPython, as described here. An advanced example of using IPython in a parallel workflow, both with and without MPI can be found here.

Sources

[1] Documentation for the wheel package