Programming on the WestGrid Checkers System
Table of Contents
- Introduction - Scope of documentation and links to other programming references.
- Compiling Serial Code - Basic compilation instructions.
- Running Serial Code - Running interactive and batch jobs.
- Parallel Programming - Programming with MPI and OpenMP.
- Debugging - Overview of debuggers available on the Checkers system.
- Linking with Installed Libraries - Using optimized linear algebra, FFT or other libraries.
- Optimization - Tips and tools for improving performance of your code.
Introduction
Documentation
This page deals with compilation, debugging and optimization of serial and parallel programs on the WestGrid Checkers system. Especially if you are new to programming in a UNIX/Linux HPC environment, please start at the main WestGrid programming page for a more general introduction. On that page you will also find links to details about programming on other WestGrid machines.
More advanced programmers may want to refer to vendor supplied documentation:
- For Intel compiler, debugger and mathematical library documentation, start at the Intel Software Development page and follow the link according to the language of interest. Choose the Linux version when there is a choice. Once on the language-specific compiler page, scroll down to the Product Documentation section for Getting Started and User's Guides.
- For GCC (GNU Compiler Collection) documentation see gcc.gnu.org/onlinedocs/.
For compiler options not presented here, details are available through the UNIX man command: man ifort, man icc, man gcc , etc.
Hardware Considerations
Checkers is an SGI Altix XE320-based cluster with 160 8-core nodes (1280 cores total) connected with a high-bandwidth, low-latency InfiniBand network. This makes the system suitable for distributed memory parallel jobs, typically programmed with MPI.
Hybrid OpenMP/MPI programs may also run effectively, but, OpenMP parallelization is limited to the eight cores within a node. Breezy is probably more suitable for pure OpenMP programs as it has 24 cores and 256 GB of memory per node.
Each 8-core node has 16 GB of memory, so, that will limit the size of jobs that can be run on Checkers.
More details about the Checkers hardware are available in the Checkers QuickStart Guide.
Compiler Recommendation
See the programming table on the WestGrid software page for a comparison of the compilers available on the various WestGrid computers. The table also lists the specific version numbers of the compilers on Checkers.
Both Intel and GCC compilers are available on the Checkers cluster. Our expectation is that the Intel compilers will produce faster code, but, feedback to support@westgrid.ca would be appreciated if you experiment with both compilers.
Compiling Serial Code
Introduction
In the compilation discussion in the following, there are two examples shown for each language. One example illustrates compiler flags to use when developing new code or debugging. A second example shows optimization options that could be tried for production code. It is advisable to test that the non-optimized and production code give similar numerical results. Sensitivity of the answers to the changes introduced by the use of the optimization flags may be indicative of a problem with the stability of the algorithm you are using.
Note, the examples shown here are for the Intel compiler.
Fortran
Although g77 and gfortran are available, better results are generally expected with Intel Fortran compiler, which is called ifort.
By default, the Intel compiler will interpret your source code as fixed-form or free-form according to the file suffix. Source code files ending in .f, .for or .ftn are treated as the older fixed-form Fortran style, whereas files with names ending in .f90 are treated as free-form. Source code ending in .F, .FOR, .FTN or .FPP (all fixed-form) or .F90 (free-form) is also accepted, but, will be preprocessed by fpp before compilation.
Example with debugging options (-CB for array bounds checking):
Note that O0 in the above is the letter "oh" followed by the number "zero".
Examples with optimization options:
Caution regarding use of -fast in makefiles: The -fast option in the above example is equivalent to -O3, -ipo and -static. The -ipo option calls for interprocedural optimization. This leads to an error if -fast is used to link routines that have been compiled individually with the -c flag (as is often done in makefiles). This problem can be avoided by compiling two or more routines together, or by using -O3 instead of -fast in your makefile, as shown in the second example above. The -axW option in that example will turn on vectorization.
C
The C compilers available on Checkers are those from Intel (icc) and the GNU Compiler Collection (cc, gcc). Faster code is expected from icc.
Example with a debugging option:
Example with an optimization option:
C++
The C++ compilers available on Checkers are those from Intel (icc, icpc) and the GNU Compiler Collection (g++). Code generated by the Intel compiler is expected to be faster than that from g++, but, you might like to try both.
The Intel compiler accepts C++ source code files ending in .C, .cc, .cp, .cpp, .cxx and .c++ . Files with a .c suffix will be treated as C source code.
Example with debugging options:
Example with an optimization option:
Running Serial Code
Interactive Runs
The Checkers login node may be used for short interactive runs during program development and porting. For longer runs, the regular production batch queue should be used, as described in section on batch jobs below.
To run a compiled program interactively through an ssh window on the login node just type its name with any required arguments at the UNIX shell prompt. File redirection commands can be added if desired. For example, to run a program named diffuse, with input taken from diffuse.in and output (that normally goes the screen) sent to a file diffuse.out, type:
Batch Runs
Production runs should be submitted as a batch job script to a TORQUE queue with the qsub command as described on the Running Jobs pages.
For serial jobs, an example job script is shown below. Replace the program name, diffuse, with the name of your executable.
#PBS -S /bin/bash
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Starting run at: `date`"
./diffuse
It is recommended that you record the performance characteristics of your code for a series of test runs so that you can estimate the run time (walltime) of a long job more accurately. Similarly, you will need to know how your program's memory requirements scale as you increase the problem size. This kind of information is used during the batch job submission to ensure that your program is run on a node with appropriate hardware and runtime limits.
Parallel Programming
Introduction
The Checkers environment can be used for interactive development of parallel programs by running them directly on the login server. However, testing should be limited to one hour using a maximum of two CPUs.
Basic commands for compiling and running MPI or OpenMP-based parallel programs are given in the following sections.
Message Passing Interface (MPI)
Compiling
To use the Intel compilers for Fortran parallel MPI code use the wrapper script mpiifort. For C and C++ use mpiicc anc mpiicpc, respectively.
Note that the commands mpif77, mpif90, mpicc and mpicxx will invoke the GNU compilers.
Add debugging or optimization options, as appropriate, similar to what was shown for serial compilation in the previous section.
To check exactly what commands are executed by these scripts, add a -show argument. For example,
To compile an MPI Fortran program, diffuse.f, with the Intel compiler, type:
Similarly, to compile an MPI C program, pi.c, linking with the standard math library, type;
For a C++ program, the command line would look like:
Running
If your program allows, compare the results with a single processor to those from a two-processor run. Gradually increase the number of processors to see how performance scales. After you have learned the characteristics of your code, please do not run with more processors than can be efficiently used, as the system is typically very busy.
Long tests or production jobs should be submitted to a TORQUE queue with the qsub command as described on the Running Jobs pages. Options for specifying the number and distribution of processors, memory and run time are mentioned there.
Here is an example of a script to run an MPI program, pn, using 2 processors. If the script file is named pn.pbs, submit the job with qsub pn.pbs.
#PBS -S /bin/bash
#PBS -l procs=2
# Script for running a parallel MPI job, pn, on Checkers
# 2010-01-12 DSP
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."
echo "Starting run at: `date`"
mpiexec ./pn
In the above script, the form "./pn" is used to ensure that the program can be run even if "." (the current directory) is not in your command PATH.
Source code for the pn sample program itself, pn.f, is available here.
Please note that if you are running MPI programs interactively, you will need to run mpdboot before running mpiexec. You also need to specify the number of processes with the mpiexec "-n 2" option. Finally, you should run mpdallexit at the end of your session to terminate the mpd daemon that was started with mpdboot.
mpiexec -n 2 ./pn
mpdallexit
Another alternative for interactive work is to use an "interactive" batch job, initiated with "qsub -I -l procs=2", for example.
OpenMP
Compiling
To compile a program containing OpenMP directives with Intel compilers, add a -ompenmp flag to the compilation. Here are some examples:
Running
Long tests or production jobs should be submitted to a TORQUE queue with the qsub command as described on the Running Jobs pages. Options for specifying the number of processors, memory and run time are mentioned there.
For OpenMP jobs, the environment variable OMP_NUM_THREADS should be set to the number of processors assigned to your job by TORQUE when submitting batch jobs with qsub. This is shown in the following script:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=2
# Script for running an OpenMP sample program, pi, on two processors on Checkers
# 2010-01-13 DSP
cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."
# Note: The OMP_NUM_THREADS should match the number of processors requested.
export OMP_NUM_THREADS=$NUM_PROCS
echo "Starting run at: `date`"
./pi
Debugging
Introduction
The Intel idb graphical debugger is available on Checkers. The gdb debugger is also available for use from character-based terminals.
Regardless of the debugger being used add a -g flag when compiling your code as a minimum prerequisite for using a debugger.
See the general comments on debugging on the main WestGrid programming page.
The following shows an example of debugging an MPI program using gdb.
First compile the program.
0-1: (gdb) break 9
0-1: Breakpoint 2 at 0x401026: file hello.f, line 9.
0-1: (gdb) run
1: Continuing.
0: Continuing.
0-1:
0-1: Breakpoint 2, MAIN__ () at hello.f:9
0-1: 9 PRINT *, "Hello world from ",rank,hostname
0-1: Current language: auto; currently fortran
0-1: (gdb)
0-1: (gdb) print rank
0: $1 = 0
1: $1 = 1
0-1: (gdb) quit
rank 0 in job 1 checkers.westgrid.ca_52503 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
Please write to support@westgrid.ca for help with debugging.
Linking with Installed Libraries
Introduction
See the Mathematical Libraries and Applications section of the WestGrid Software page for a description of some of the optimized linear algebra and Fourier transform libraries that can be linked with your code.
Improving Performance
Introduction
We encourage you to have your code reviewed by a WestGrid analyst. Please write to support@westgrid.ca .
Basic optimization techniques, some of which are applicable to the environment on Checkers, are outlined in these course notes.
Here is an example of profiling an MPI program to look for communication bottlenecks.
#PBS -S /bin/bash
#PBS -l procs=4
cd $PBS_O_WORKDIR
unset TMPDIR
export -n TMPDIR
export PATH=/global/scratch/software/intel/impi/3.2.1.009/bin64:$PATH
mpirun -r ssh -trace -n 4 ./sample1
Updated 2011-11-09.
