Programming on the WestGrid Cortex System

Table of Contents

Introduction

Documentation

This page deals with compilation, debugging and optimization of serial and parallel programs on the WestGrid Cortex system, including the synapse and dendrite machines. Especially if you are new to programming in a UNIX/Linux HPC environment, please start at the main WestGrid programming page for a more general introduction. On that page you will also find links to details about programming on other WestGrid machines.

More advanced programmers may want to refer to vendor supplied documentation:

For compiler options not presented here, details are available through the UNIX man command: man xlc, man xlC, man g++, man xlf, etc.

Hardware Considerations

The Cortex system consists of a 4-processor login node (cortex) and two 64-processor nodes known as synapse and dendrite. The system is intended for parallel jobs or large-memory serial jobs, with current policy specifying a minimum of 4 processors for jobs on dendrite and the full 64 processors on synapse. OpenMP or MPI parallel programming techniques may be used.

Compiler Recommendation

See the programming table on the WestGrid software page for a comparison of the compilers available on the various WestGrid computers. The table also lists the specific version numbers of the compilers on the Cortex system.

Both IBM and GCC compilers are available on the Cortex system. Our expectation is that the IBM compilers will produce faster code, but, feedback to support@westgrid.ca would be appreciated if you experiment with both compilers.

Note the use of the -q64 flag in the examples with the IBM XL compilers below. This requests that a 64-bit object format will be used. This is recommended, as the (default) 32-bit format is subject to more restrictive memory allocation limits.

Also, note the _r in the IBM XL compiler names. These versions of the compilers generate reentrant thread-safe code, required for parallel programming, but, also are appropriate for serial compilation.

When using the GCC compilers, recommended options are -maix64 -O3, with -maix64 being required to generate 64-bit executables.

Compiling Serial Code

Introduction

In the compilation discussion in the following, there are two examples shown for each language. One example illustrates compiler flags to use when developing new code or debugging. A second example shows optimization options that should be used for production code. It is advisable to test that the non-optimized and production code give similar numerical results. Sensitivity of the answers to the changes introduced by the use of the optimization flags may be indicative of a problem with the stability of the algorithm you are using.

Fortran

Although g77 is available, better results are generally expected with IBM XL Fortran compiler, which can be called as xlf_r (FORTRAN 77), xlf90_r (Fortran 90) or xlf95_r (Fortran 95) depending on the desired default language level. If the compiler is called with xlf_r, the source code will be assumed to be in fixed format, whereas for xlf90_r the default is free format.

The compiler will accept source code files ending in .f or .F directly, but, to compile files ending in .f90, use -qsuffix=f=f90.

Example with debugging options (-C for array bounds checking):

xlf95_r -g -C -qflttrap=overflow:underflow:zerodivide:enable -q64 diffuse.f writeppm.f -o diffuse

Example with an optimization option:

xlf95_r -O5 -q64 diffuse.f writeppm.f -o diffuse

C

Although gcc is available, but, better results are expected with the IBM XL compiler (xlc_r).

C language files are expected to have a .c suffix.

Example with debugging options:

xlc_r -g -qflttrap=overflow:underflow:zerodivide:enable -q64 diffuse.c writeppm.c -lm -o diffuse

Example with an optimization option:

xlc_r -O5 -q64 diffuse.c writeppm.c -lm -o diffuse

C++

Although g++ is available, better results are expected with the IBM XL compiler (xlC_r)

The compiler accepts C++ source code files ending in .C, .cc, .cp, .cpp, .cxx and .c++ . Files with a .c suffix will be treated as C source code.

Example with a debugging options:

xlC_r -g -qflttrap=overflow:underflow:zerodivide:enable -q64 diffuse.C writeppm.C -lm -o diffuse

Example with an optimization option:

xlC_r -O5 -q64 diffuse.C writeppm.C -lm -o diffuse

Running Serial Code

Interactive Runs

The cortex login machine may be used for short interactive runs during program development.

For longer runs the regular production batch queue should be used, as described in the section on batch jobs below.

To run a compiled program interactively through an ssh window on the login node just type its name with any required arguments at the UNIX shell prompt. File redirection commands can be added if desired. For example, to run a program named diffuse, with input taken from diffuse.in and output (that normally go the screen) sent to a file diffuse.out, type:

diffuse < diffuse.in > diffuse.out

Batch Runs

Production runs or long test jobs are submitted to a batch queue, as described elsewhere.

The TORQUE mem resource parameter is not supported on cortex. Instead, for serial jobs requiring a large amount of memory, one processor per 4 GB of RAM should be requested with the ncpus resource parameter. So, for example, if a program, diffuse, requires 16 GB of RAM, it can be run by requesting 4 processors, using a job script similar to that shown below.

#!/usr/bin/bash
#PBS -S /usr/bin/bash
#PBS -l ncpus=4

# Script for running large memory serial job, diffuse.
# Memory equivalence should be (ncpus*4GB), so, 16GB in this example.
# 2005-12-07 DSP

cd $PBS_O_WORKDIR

echo "Current working directory is `pwd`"

echo "Starting run at: `date`"
./diffuse
echo "Job finished at: `date`"

It is recommended that you record the performance characteristics of your code for a series of test runs so that you can estimate the run time (walltime) of a long job more accurately. Similarly, you will need to know how your program's memory requirements scale as you increase the problem size. This kind of information is used during the batch job submission to ensure that your program is run on a node with appropriate hardware and runtime limits.

Parallel Programming

Introduction

The Cortex environment can be used for interactive development of parallel programs by running them directly after setting a couple of environment variables, MP_PROCS and MP_HOSTFILE as shown below.

Basic commands for compiling MPI or OpenMP-based parallel programs are given in the following sections.

Message Passing Interface (MPI)

Compiling

See the serial code section for examples of compiler options for development and production code. For Fortran parallel MPI code just prefix the compiler name with mp and compile as you would for serial code. So,you use mpxlf_r. For C and C++ the naming convention is different. Use mpcc_r and mpCC_r, respectively.

Some examples:

mpxlf_r -O5 -q64 diffuse.f writeppm.f -o diffuse
mpcc_r -O5 -q64 pi.c -lm -o pi
mpCC_r -O5 -q64 pi.C -lm -o pi

Running

If your program allows, compare the results with a single processor to those from a two-processor run. Gradually increase the number of processors to see how performance scales. After you have learned the characteristics of your code, please do not run with more processors than can be efficiently used, as the system is typically very busy.

MPI jobs are run by submitting a script to the TORQUE batch job handling system with the qsub command. Here is an example of a script to run an MPI program, pn, using 2 processors. If the script file is named pn.pbs, submit the job with qsub pn.pbs.

#!/usr/bin/bash
#PBS -S /usr/bin/bash
#PBS -l ncpus=2

# Script for running MPI sample program pn on cortex
# 2007-01-19 DSP

cd $PBS_O_WORKDIR

echo "Current working directory is `pwd`"

# Note: MP_PROCS should be set to the number of processors required.
# This should never exceed the TORQUE ncpus request above, but,
# in the case of large memory jobs, may be less than ncpus.
export MP_PROCS=2

echo "Starting run at: `date`"
./pn
echo "Job finished at: `date`"

The form "./pn" is used to ensure that the program can be run even if "." (the current directory) is not in your PATH.

Source code for the pn program itself is pn.f.

There are special considerations for large memory jobs - those requiring more than 4GB of RAM per process. MP_PROCS should be set to the number of processors actually used, but, the TORQUE ncpus parameter is chosen on the basis of the total memory requirement. For this purpose, each CPU is equivalent to 4GB of RAM. To calculate ncpus, divide the total memory requirement by 4GB. For example, if each process requires 8GB of RAM, then ncpus should be twice MP_PROCS.

OpenMP

Compiling

To compile a program containing OpenMP directives, add a -qsmp=omp flag to the compilation.

Some examples:

xlf_r -qsmp=omp -qnosave -O5 -q64 diffuse.f writeppm.f -o diffuse
xlc_r -qsmp=omp -O5 -q64 pi.c -lm -o pi
xlC_r -qsmp=omp -O5 -q64 pi.C -lm -o pi

Note that the -qnosave is used for the Fortran compilation to override the IBM default of treating local variables as if they have the SAVE attribute (as is used in Fortran 77). This could be disasterous for "local" variables referenced from multiple parallel regions.  For example, arrays declared inside a subroutine that is called in a parallel region would not be private to each thread, contrary to what one would expect.

Running

See the documentation on job submission for details on queues and the syntax for requesting nodes.

For OpenMP jobs, the environment variable OMP_NUM_THREADS should be set to the number of processors assigned to your job by TORQUE when submitting batch jobs with qsub. This is shown in the following script:

#!/usr/bin/bash
#PBS -S /usr/bin/bash
#PBS -l ncpus=2

# Script for running OpenMP sample program pi on cortex
# 2007-01-19 DSP

cd $PBS_O_WORKDIR

echo "Current working directory is `pwd`"

# Note: MP_PROCS should be set to the number of processors required.
# This should never exceed the TORQUE ncpus request above, but,
# in the case of large memory jobs, may be less than ncpus.
export MP_PROCS=2

export OMP_NUM_THREADS=$MP_PROCS

echo "Starting run at: `date`"
./pi
echo "Job finished at: `date`"

Debugging

Introduction

The dbx debugger is available on the Cortex system for use from character-based terminals.

Regardless of the system being used, adding a -g flag to the compilation is a minimum prerequisite for using a debugger.

Please write to support@westgrid.ca for help with debugging.

Linking with Installed Libraries

Introduction

See the Mathematical Libraries and Applications section of the WestGrid Software page for a description of some of the optimized linear algebra and Fourier transform libraries that can be linked with your code.

C++ Libraries

An implementation of the C++ Standard Template Library called STLport is available. C++ programmers may also be interested in Boost, an eclectic collection of C++ libraries. Both these libraries have been installed in /usr/local/lib, with relevant include files in subdirectories under /usr/local/include.

Improving Performance

Introduction

We encourage you to have your code reviewed by a WestGrid analyst. Please write to support@westgrid.ca .

Basic profiling techniques specific to the IBM environment on cortex, in the context of program optimization, are outlined in these course notes.


Updated 2009-03-31.