Programming on the WestGrid Cortex System
Table of Contents
- Introduction - Scope of documentation and links to other programming references.
- Compiling Serial Code - Basic compilation instructions.
- Running Serial Code - Running interactive and batch jobs.
- Parallel Programming - Programming with MPI and OpenMP.
- Debugging - Overview of debuggers available on the Cortex system.
- Linking with Installed Libraries - Using optimized linear algebra, FFT or other libraries.
- Optimization - Tips and tools for improving performance of your code.
Introduction
Documentation
This page deals with compilation, debugging and optimization of serial and parallel programs on the WestGrid Cortex system, including the synapse and dendrite machines. Especially if you are new to programming in a UNIX/Linux HPC environment, please start at the main WestGrid programming page for a more general introduction. On that page you will also find links to details about programming on other WestGrid machines.
More advanced programmers may want to refer to vendor supplied documentation:
- For IBM compiler and debugger documentation, start at the IBM AIX compiler information center.
- For GCC (GNU Compiler Collection) documentation see gcc.gnu.org/onlinedocs/.
For compiler options not presented here, details are available through the UNIX man command: man xlc, man xlC, man g++, man xlf, etc.
Hardware Considerations
The Cortex system consists of a 4-processor login node (cortex) and two 64-processor nodes known as synapse and dendrite. The system is intended for parallel jobs or large-memory serial jobs, with current policy specifying a minimum of 4 processors for jobs on dendrite and the full 64 processors on synapse. OpenMP or MPI parallel programming techniques may be used.
Compiler Recommendation
See the programming table on the WestGrid software page for a comparison of the compilers available on the various WestGrid computers. The table also lists the specific version numbers of the compilers on the Cortex system.
Both IBM and GCC compilers are available on the Cortex system. Our expectation is that the IBM compilers will produce faster code, but, feedback to support@westgrid.ca would be appreciated if you experiment with both compilers.
Note the use of the -q64 flag in the examples with the IBM XL compilers below. This requests that a 64-bit object format will be used. This is recommended, as the (default) 32-bit format is subject to more restrictive memory allocation limits.
Also, note the _r in the IBM XL compiler names. These versions of the compilers generate reentrant thread-safe code, required for parallel programming, but, also are appropriate for serial compilation.
When using the GCC compilers, recommended options are -maix64 -O3, with -maix64 being required to generate 64-bit executables.
Compiling Serial Code
Introduction
In the compilation discussion in the following, there are two examples shown for each language. One example illustrates compiler flags to use when developing new code or debugging. A second example shows optimization options that should be used for production code. It is advisable to test that the non-optimized and production code give similar numerical results. Sensitivity of the answers to the changes introduced by the use of the optimization flags may be indicative of a problem with the stability of the algorithm you are using.
Fortran
Although g77 is available, better results are generally expected with IBM XL Fortran compiler, which can be called as xlf_r (FORTRAN 77), xlf90_r (Fortran 90) or xlf95_r (Fortran 95) depending on the desired default language level. If the compiler is called with xlf_r, the source code will be assumed to be in fixed format, whereas for xlf90_r the default is free format.
The compiler will accept source code files ending in .f or .F directly, but, to compile files ending in .f90, use -qsuffix=f=f90.
Example with debugging options (-C for array bounds checking):
Example with an optimization option:
C
Although gcc is available, but, better results are expected with the IBM XL compiler (xlc_r).
C language files are expected to have a .c suffix.
Example with debugging options:
Example with an optimization option:
C++
Although g++ is available, better results are expected with the IBM XL compiler (xlC_r)
The compiler accepts C++ source code files ending in .C, .cc, .cp, .cpp, .cxx and .c++ . Files with a .c suffix will be treated as C source code.
Example with a debugging options:
Example with an optimization option:
Running Serial Code
Interactive Runs
The cortex login machine may be used for short interactive runs during program development.
For longer runs the regular production batch queue should be used, as described in the section on batch jobs below.
To run a compiled program interactively through an ssh window on the login node just type its name with any required arguments at the UNIX shell prompt. File redirection commands can be added if desired. For example, to run a program named diffuse, with input taken from diffuse.in and output (that normally go the screen) sent to a file diffuse.out, type:
Batch Runs
Production runs or long test jobs are submitted to a batch queue, as described elsewhere.
The TORQUE mem resource parameter is not supported on cortex. Instead, for serial jobs requiring a large amount of memory, one processor per 4 GB of RAM should be requested with the ncpus resource parameter. So, for example, if a program, diffuse, requires 16 GB of RAM, it can be run by requesting 4 processors, using a job script similar to that shown below.
#PBS -S /usr/bin/bash
#PBS -l ncpus=4
# Script for running large memory serial job, diffuse.
# Memory equivalence should be (ncpus*4GB), so, 16GB in this example.
# 2005-12-07 DSP
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Starting run at: `date`"
./diffuse
echo "Job finished at: `date`"
It is recommended that you record the performance characteristics of your code for a series of test runs so that you can estimate the run time (walltime) of a long job more accurately. Similarly, you will need to know how your program's memory requirements scale as you increase the problem size. This kind of information is used during the batch job submission to ensure that your program is run on a node with appropriate hardware and runtime limits.
Parallel Programming
Introduction
The Cortex environment can be used for interactive development of parallel programs by running them directly after setting a couple of environment variables, MP_PROCS and MP_HOSTFILE as shown below.
Basic commands for compiling MPI or OpenMP-based parallel programs are given in the following sections.
Message Passing Interface (MPI)
Compiling
See the serial code section for examples of compiler options for development and production code. For Fortran parallel MPI code just prefix the compiler name with mp and compile as you would for serial code. So,you use mpxlf_r. For C and C++ the naming convention is different. Use mpcc_r and mpCC_r, respectively.
Some examples:
mpcc_r -O5 -q64 pi.c -lm -o pi
mpCC_r -O5 -q64 pi.C -lm -o pi
Running
If your program allows, compare the results with a single processor to those from a two-processor run. Gradually increase the number of processors to see how performance scales. After you have learned the characteristics of your code, please do not run with more processors than can be efficiently used, as the system is typically very busy.
MPI jobs are run by submitting a script to the TORQUE batch job handling system with the qsub command. Here is an example of a script to run an MPI program, pn, using 2 processors. If the script file is named pn.pbs, submit the job with qsub pn.pbs.
#PBS -S /usr/bin/bash
#PBS -l ncpus=2
# Script for running MPI sample program pn on cortex
# 2007-01-19 DSP
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
# Note: MP_PROCS should be set to the number of processors required.
# This should never exceed the TORQUE ncpus request above, but,
# in the case of large memory jobs, may be less than ncpus.
export MP_PROCS=2
echo "Starting run at: `date`"
./pn
echo "Job finished at: `date`"
The form "./pn" is used to ensure that the program can be run even if "." (the current directory) is not in your PATH.
Source code for the pn program itself is pn.f.
There are special considerations for large memory jobs - those requiring more than 4GB of RAM per process. MP_PROCS should be set to the number of processors actually used, but, the TORQUE ncpus parameter is chosen on the basis of the total memory requirement. For this purpose, each CPU is equivalent to 4GB of RAM. To calculate ncpus, divide the total memory requirement by 4GB. For example, if each process requires 8GB of RAM, then ncpus should be twice MP_PROCS.
OpenMP
Compiling
To compile a program containing OpenMP directives, add a -qsmp=omp flag to the compilation.
Some examples:
xlc_r -qsmp=omp -O5 -q64 pi.c -lm -o pi
xlC_r -qsmp=omp -O5 -q64 pi.C -lm -o pi
Note that the -qnosave is used for the Fortran compilation to override the IBM default of treating local variables as if they have the SAVE attribute (as is used in Fortran 77). This could be disasterous for "local" variables referenced from multiple parallel regions. For example, arrays declared inside a subroutine that is called in a parallel region would not be private to each thread, contrary to what one would expect.
Running
See the documentation on job submission for details on queues and the syntax for requesting nodes.
For OpenMP jobs, the environment variable OMP_NUM_THREADS should be set to the number of processors assigned to your job by TORQUE when submitting batch jobs with qsub. This is shown in the following script:
#PBS -S /usr/bin/bash
#PBS -l ncpus=2
# Script for running OpenMP sample program pi on cortex
# 2007-01-19 DSP
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
# Note: MP_PROCS should be set to the number of processors required.
# This should never exceed the TORQUE ncpus request above, but,
# in the case of large memory jobs, may be less than ncpus.
export MP_PROCS=2
export OMP_NUM_THREADS=$MP_PROCS
echo "Starting run at: `date`"
./pi
echo "Job finished at: `date`"
Debugging
Introduction
The dbx debugger is available on the Cortex system for use from character-based terminals.
Regardless of the system being used, adding a -g flag to the compilation is a minimum prerequisite for using a debugger.
Please write to support@westgrid.ca for help with debugging.
Linking with Installed Libraries
Introduction
See the Mathematical Libraries and Applications section of the WestGrid Software page for a description of some of the optimized linear algebra and Fourier transform libraries that can be linked with your code.
C++ Libraries
An implementation of the C++ Standard Template Library called STLport is available. C++ programmers may also be interested in Boost, an eclectic collection of C++ libraries. Both these libraries have been installed in /usr/local/lib, with relevant include files in subdirectories under /usr/local/include.
Improving Performance
Introduction
We encourage you to have your code reviewed by a WestGrid analyst. Please write to support@westgrid.ca .
Basic profiling techniques specific to the IBM environment on cortex, in the context of program optimization, are outlined in these course notes.
Updated 2009-03-31.
