You are here

Running Jobs

Table of Contents

Introduction

A great majority of the computational work on WestGrid systems is carried out through non-interactive batch processing. Job scripts containing commands to be executed are submitted from a login server to a batch job handling system, which queues the requests, allocates processors and starts and manages the jobs. It is usually not necessary (and in some cases not permitted) to log on to the compute nodes directly. A user can (and should) monitor his or her jobs periodically as they run, but, does not have to remain logged in the entire time. The batch job system can also be used to reserve processors for interactive work, as will be explained later. Through the use of grid-oriented software it is possible to submit jobs from remote workstations, but, this capability is not discussed here as it is not yet widely supported.

The system software that handles your batch jobs consists of two pieces: a resource manager (TORQUE) and a scheduler (Moab). Documentation for these packages is available through Adaptive Computing. However, typical users will not need to study those details. Together, TORQUE and Moab provide a suite of commands for submitting jobs, altering some of the properties of waiting jobs (such as reordering or deleting them), monitoring their progress and killing ones that are having problems or are no longer needed. Only the most commonly used commands are mentioned here.

The documentation in this Running Jobs section includes a description of the general features of job scripts, how to submit them for execution and how to monitor their progress. As some of the commands, sample scripts and usage policies vary from one WestGrid system to another, please consult the QuickStart Guide for the particular system on which you are working for additional details.

File Systems

An important aspect of running jobs is choosing the directories from which to start the jobs, for temporary files, if any, that the program may need to read or write while it is running, and for the final output that needs to be saved for post-processing and analysis. In addition to the home directory, which is the default working directory when you log in, most WestGrid systems provide large capacity, high performance file systems that are intended to be used for temporary storage by running programs. Please consult the QuickStart Guide for the particular system on which you are working for specifics.

For long-term storage of output results, see the Silo QuickStart Guide and the main storage page.

Submitting Batch Jobs with qsub

A batch job script is a text file of commands for a specified UNIX shell (typically bash or tcsh) to interpret, similar to what you could execute by typing directly at a keyboard. The job is submitted to an input queue using the qsub command. A job will wait in the input queue for a longer or shorter time, depending on factors such as system load and the priority assigned to the job. When appropriate resources become available to run a job, it is removed from the input queue and started on one or more assigned processors. A job will be terminated if it exceeds its allotted time limit, or, on some systems, if it exceeds memory limits. By default, the standard output and error streams from the job are directed to files in the directory from which the job was submitted.

In the simplest case, a batch job file called batch.pbs could be submitted with:

qsub batch.pbs

The batch job system needs a description of the job requirements. These can be specified using qsub command-line options or by using special directive lines in the batch job script itself.  The most commonly-used qsub flag, -l (letter ell), is used to specify various resources required by the job, such as the maximum run time, the number of processors needed and the amount of memory required. For a full list of qsub options one can use man qsub. However, the qsub manual page just gives a generic description that doesn't take into account some of the specific hardware and policies at the various WestGrid sites. By choosing combinations of resource requests that cannot be satisified, it is quite possible to submit jobs that will sit in the input "forever", without a warning message. This can happen, for example, if you try to use the ncpus parameter, instead of nodes, to request processors on one of the distributed clusters, or in some situations if you request more memory than is physically available.

Explanations for the resource-related parameters used above and other qsub options are given in a later section.

If the qsub command successfully processes the script and has queued a job for execution, it will return a JOBID in the form of a number followed by the name of a server. Generally, the server portion of the JOBID can be ignored. The JOBID number can be used in other commands involved in monitoring or deleting the job.

If no UNIX shell is explicitly specified in the job script, the user's login shell will be used to interpret the script when it is run on a compute node assigned by the batch system. When the script is executed, the program runs as a child process of the script. When the program finishes, execution returns to the next line of the script. When the last line of the script has been completed, the job terminates. Jobs can be aborted using the qdel command, which is discussed later in these notes.

Summary of Job Commands

CLICK HERE for a handout of common pbs script commands, environment variables, job monitoring commands, cluster and group information. 

Sample Batch Job Scripts

In this section, several examples are shown of batch job scripts.  The simplest case is that of running a serial program.  Describing the resource requirements of a parallel program is somewhat more complicated and there are differences depending on whether one is running a distributed-memory (MPI-based) parallel program or one that needs to share memory on a single compute node, such as OpenMP-based parallel programs.

Script Example for a Serial Program

Here is an example of a script for a job to run a serial program named diffuse.

#!/bin/bash
#PBS -S /bin/bash

# Script for running serial program, diffuse.

cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
echo "Running on hostname `hostname`"

echo "Starting run at: `date`"
./diffuse
echo "Program diffuse finished with exit code $? at: `date`"

Note the back ticks (`) used in some of the informational lines in the script.

Next we discuss some additional considerations for parallel jobs.

Script Example for an MPI-based Parallel Program

Here is an example of a script for a job to run an MPI-based parallel program named mpi_diffuse.

#!/bin/bash
#PBS -S /bin/bash

# Sample script for running an MPI-based parallel program, mpi_diffuse.

cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"

echo "Node file: $PBS_NODEFILE :"
echo "---------------------"
cat $PBS_NODEFILE
echo "---------------------"

# On many WestGrid systems a variable PBS_NP is automatically
# assigned the number of cores requested of the batch system
# and one could use
# echo "Running on $PBS_NP cores."
# On systems where $PBS_NP is not available, one could use:

CORES=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $CORES cores."

echo "Starting run at: `date`"

# On most WestGrid systems, mpiexec will automatically start
# a number of MPI processes equal to the number of cores
# requested. The -n arugment can be used to explicitly
# use a specific number of cores.

mpiexec -n ${CORES} ./mpi_diffuse < mpi_diffuse.in
echo "Program mpi_diffuse finished with exit code $? at: `date`"

For a parallel job on one of the distributed-memory clusters, the qsub command could take the form:

qsub -l procs=16,pmem=2gb,walltime=72:00:00 mpi_diffuse.pbs

in which 16 processors are requested (procs parameter), using at most 2 GB of memory per process (pmem parameter) and running for at most 72 hours (walltime parameter).  A table in a following section gives more information about the various resource parameters.  Note (2011-08-16): there appears to be a bug in qsub such that the procs parameter must appear first in the list of resource parameters.

For most WestGrid systems, using the procs parameter is the recommended way to specify the number of processors to use for MPI-based parallel jobs.  However, on the Lattice and Parallel clusters, intended for large parallel jobs, complete nodes should be used unless memory contention would dictate otherwise. Since there are 8 cores per node on Lattice, a  ppn (processors per node) parameter of 8 will request that all the processors on a node be used.  Also, it is recommended that you ask for 10-11 GB of memory per node requested, using the mem parameter.  So, a typical job submission on Lattice would look like:

qsub -l nodes=4:ppn=8,mem=40gb,walltime=72:00:00 mpi_diffuse.pbs

On the Parallel cluster, the nodes have 12 cores per node and there is more memory available per node.  A typical job submission command on Parallel is:

qsub -l nodes=4:ppn=12,mem=88gb,walltime=72:00:00 mpi_diffuse.pbs

Another special case is the Hungabee system for which the command line for running a parallel MPI program has a special parameter (dplace) to distributed MPI tasks properly.  See the Hungabee QuickStart Guide for more information.

Script Example for an OpenMP-based Parallel Program

Here is an example of a script for a job to run an OpenMP-based parallel program named openmp_diffuse. This differs from the distributed-memory MPI-based parallel case in that the job must be restricted to a single node and the number of OpenMP threads must be matched to the number of cores requested by setting the OMP_NUM_THREADS environment variable.

As with the MPI-based parallel case discussed above, there is a special command for running OpenMP-based programs on Hungabee. So, the Hungabee QuickStart Guide should be consulted for details if you are working on that system.

#!/bin/bash
#PBS -S /bin/bash

# Sample script for running an OpenMP-based parallel program, openmp_diffuse.

cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"

echo "Running on `hostname`"

# On many WestGrid systems a variable PBS_NUM_PPN is automatically
# assigned the number of cores requested of the batch system
# on a single node (as is appropriate for OpenMP-based programs) and one could use

# echo "Running on $PBS_NUM_PPN cores."
# export OMP_NUM_THREADS=$PBS_NUM_PPN

# On systems where $PBS_NUM_PPN is not available, one could use:

CORES=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $CORES cores."

echo "Starting run at: `date`"

# Set the number of threads that OpenMP can use to the number of cores requested
export OMP_NUM_THREADS=$CORES

./openmp_diffuse < openmp_diffuse.in
echo "Program openmp_diffuse finished with exit code $? at: `date`"

An OpenMP job must run on a single compute node. Consequently, the TORQUE procs parameter should not be used, as this might result in the batch scheduler assigning more than one node. Instead, use nodes=1 along with a ppn parameter to specify the number of cores required on that single node.  For example:

qsub -l nodes=1:ppn=12,mem=20gb,walltime=72:00:00 openmp_diffuse.pbs

Here 12 processor cores are requested (ppn parameter), using a total of at most 20 GB of memory (mem parameter) and running for at most 72 hours (walltime parameter).  The table in the next section gives more information about the various resource parameters.

TORQUE Directives in Batch Job Scripts and qsub Options

Although qsub arguments can be used to define job characteristics, an alternative is to include these in the batch job scripts themselves. Special comment lines near the beginning of a script are used to specify TORQUE directives that are to be processed by qsub. It is primarily a personal preference as to whether they are specified in the batch script or on the command line, although debugging job submission problems is generally easier if they are in the batch job file.

TORQUE evolved from software called PBS (Portable Batch System). Consequences of that history are that the TORQUE directive lines begin with #PBS, some environment variables contain "PBS" (such as $PBS_O_WORKDIR in the example script above) and the script files themselves typically have a .pbs suffix (although that is not required).

Note: All the #PBS lines must occur before the first executable statement (non-comment or blank line) in the script.

Here is an example script containing some TORQUE directives.

#!/bin/sh
# Script for running serial program, diffuse.

#PBS -l walltime=72:00:00
#PBS -l mem=2000mb
#PBS -r n

cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"

echo "Starting run at: `date`"
./diffuse
echo "Program diffuse finished with exit code $? at: `date`"

For each of the #PBS lines there is a corresponding qsub command line argument (just leave off the #PBS). If both command line arguments and #PBS lines in the script request the same type of resource, the command line argument is used. So, for example, if one wanted to do a short test run of the above script, diffuse.pbs, one could override the walltime in the script by submitting it with:

qsub -l walltime=00:30:00 diffuse.pbs

Some of the available TORQUE directives are described in the table below, with those related to resource requests (#PBS -l resource) being the most important.

Script line Description
#PBS -S shell

The -S flag is used to specify the UNIX shell that will be used to execute the script on the compute node to which the job is assigned. If this directive is omitted, a user's login shell will be used (typically bash or tcsh). Example:

#PBS -S /bin/sh

Note that the path to a given shell may vary from system to system.

#PBS -l resource

The -l option is used to specify various resources that are required by your job, such as the number of processors and memory required, or to set resource limits, such as the maximum time the job should run. Multiple options may be combined in one line, separated by commas. Examples:

#PBS -l walltime=72:00:00
#PBS -l mem=8gb
#PBS -l walltime=72:00:00,mem=8gb

Here is a description of some of the possible resources that can be specified. Not all resource types are applicable to all WestGrid systems, so, please consult the site-specific Running Jobs pages for more information.

walltime

#PBS -l walltime=72:00:00

This is the maximum elapsed time that the job will be allowed to run, specified as hours, minutes and seconds in the form HH:MM:SS. If a job exceeds its walltime limit, it is killed by the system. It is best to overestimate the walltime to avoid a run being spoiled by early termination, however, an accurate walltime estimate will allow your job to be scheduled more effectively. It is best to design your code with a capability to write checkpoint data to a file periodically and to be able to restart from the time of the most recent checkpoint by reading that data. That way if the run reaches its walltime limit (or fails for some other reason), only a small fraction of the total computation will have to be redone in a subsequent run.

mem

#PBS -l mem=2000mb

The mem parameter should be an estimate of the total amount of memory required by the job. For parallel jobs, multiply the memory per process by the number of processes. Append units of MB or GB as appropriate. The value given must be an integer, so, for example, use mem=3584MB instead of mem=3.5GB (1 GB = 1024 MB) .

Note: the mem parameter is not used on the large shared-memory machines. Use ncpus instead to request an equivalent number of processors, based on the memory per processor in the machine.

Also see pmem below.

nodes and ppn

#PBS -l nodes=4:ppn=2

Use a combination of nodes and processors per node (ppn) to request the total number of processors needed. This applies to the distributed-memory clusters only. For the large shared-memory machines, use ncpus instead.

The range of reasonable values for the nodes parameter depends not only on how many physical nodes there are, but, on how effectively the program being run scales as the number of processors is increased. Also, there are site-specific policies about how many processors may be used. See the individual Running Jobs page for the particular machine you are using for information about these policies.

The maximum value ppn that you can use depends on the hardware on which the job is run. See the hardware section of the QuickStart Guide for the system you are using for more information. If it will not cause memory contention problems for your application, it is generally best to use the maximum value of ppn so that your program uses all the processors on a node. Communication between processors on a node is faster than between nodes. Also, if you use a value of ppn less than the maximum, there is the possibility of another user's job being assigned to the same node as your job. In such a case, it is important to use an accurate value for the mem parameter so that the scheduling system can avoid putting two jobs, which combined would exhaust the memory on a node.

ncpus

The ncpus parameter was formerly used to request a specific number of cores on large shared-memory machines that were each effectively a single compute node. However, all those machines (Cortex, Nexus and associated systems) have been retired and we no longer support the use of ncpus on any WestGrid systems.

If you are migrating from one of those older systems and currently have ncpus in your job scripts, you will need to replace ncpus with an alternative resource request.  For example, if you are moving a large-memory serial code or an OpenMP-based parallel program to the Breezy cluster, instead of ncpus=xx please request a single node using -l nodes=1:ppn=xx, where xx is the number of processors (cores) required.

If you have been using ncpus for a distributed-memory parallel program based on MPI, please see the section on procs in this table for an alternative.

pmem

#PBS -l pmem=2000mb

Instead of specifying the total memory requirement of your job with the mem parameter (described above), you can specify a per process memory limit, pmem.  Note however, that mem and pmem are independent parameters, so, on some systems it may be necessary to specify both mem and pmem.

procs

#PBS -l procs=8

The procs resource request allows the scheduler to distribute the requested number of processors among any available nodes on the cluster.  This can reduce the waiting time in the input queue.  The procs format is supported on most WestGrid systems and is generally preferred to nodes=nn:ppn=mm on clusters with more than 2 processors per node. So, for example, it can be used on Bugaboo, Grex, Hermes/Nestor and Orcinus. However, we advise against its use on Lattice (where nodes=nn:ppn=8 is used), Parallel (where nodes=nn:ppn=12 is used) and on Breezy (where nodes=1:ppn=mm is recommended, with mm less than or equal to 24).

In using this resource request format, you are not guaranteed any specific number of processors per node.  As such, it is not appropriate for OpenMP programs or other multi-threaded programs, with rare exception.  For example, if you have a mixed MPI-OpenMP program in which you limit the number of threads per process to just one (OMP_NUM_THREADS=1) you could use procs.   In other cases, you may also have to combine procs with pmem to make sure that the scheduler does not assign too many of your processes to the same node.

Note (2011-08-16): there appears to be a bug in qsub such that the procs parameter must appear first in the list of resource parameters.

file

#PBS -l file=10gb

On some WestGrid systems, you can specify a file resource parameter. Your job will only be scheduled to nodes having the indicated scratch file space per core. Note that the file request is per core and not the total disk required by the job.

#PBS -r [y/n]

The -r flag, followed by a y or n for yes or no, respectively, is used to indicate that a job is restartable or not. 

#PBS -r n
#PBS -N job_name

Use -N followed by up to 15 characters (starting with a letter and containing no spaces) to assign a name to your job. If no -N option is given, a default job name will be constructed from the name of the job script (truncated, if necessary, to 15 characters).

#PBS -N diffuse_run_1
#PBS -q queue

On some systems there is more than one queue that can be selected using the -q argument to qsub, but, on most WestGrid systems, the default is appropriate and the queue argument is not needed. See the Running Jobs documentation for the individual systems for information about the queues that are available.

#PBS -q pre
#PBS -o file
#PBS -e file
#PBS -j [eo/oe]

These options control how the standard output (-o option) and standard error (-e option) streams from the job combined (-j option) and directed to files. The -j option is used to request that the standard output and standard error be joined into a single intermixed stream. With -j eo, the two streams become the standard error stream and with -j oe, they become the standard output stream for the job. It is also possible to direct the combined stream to a single file (according to whether the -o or -e option is used).

If neither the -o nor the -e option is used, the standard error and output streams are directed to two separate files, wtih the file names being of the form: job_name.[eo]job_number.

#PBS -o diffuse_job.out
#PBS -e diffuse_job.err
#PBS -j eo
#PBS -e diffuse_job.err
#PBS -M email
#PBS -m event

If you would like email notification about events related to your job, use the -M option to specify an email address and use one or more of the -m options to choose which events trigger an email. The event parameter can be one of the following:

n - Do not send mail
b - Send mail when execution has begun.
e - Send mail when job terminates.
a - Send mail if job is has been aborted.

The options can be combined, such as in -m ea.

#PBS -M unknown@university.ca
#PBS -m bea

Environment Variables in Batch Job Scripts

When a batch job is started, a number of environment variables are created that can be used in the batch job script. A few of the most commonly used variables are described here.

Environment
Variable
Description
PBS_O_WORKDIR

This is the directory from which the job was submitted. If submitting multiple jobs from different directories, it may be convenient to use

cd $PBS_O_WORKDIR

as shown in the example script above, so that input files can be referenced with respect to the job submission directory, rather than using absolute paths.

PBS_JOBID

This is a unique job ID assigned to the job. By using this variable in the name of an output file, one can keep output from different jobs in the same directory without conflicts. When monitoring jobs, or diagnosing problems after a failed run, it is also convenient to have files tagged with the JOBID. For example:

JOBINFO=diffuse.${PBS_JOBID}
echo "Starting run at: `date`" > $JOBINFO
echo "Current working directory is `pwd`" >> $JOBINFO
...
PBS_NODEFILE The value of this variable is the name of a temporary file containing a list of the compute nodes to which the job has been assigned (with nodes repeated if more than one processor per node is used on multi-node systems). This variable is not useful on the shared-memory computers, since these are seen by the batch job system as being single nodes and the PBS_NODEFILE file is just a single line (even if multiple processors have been requested). However, it has been used for MPI-based parallel jobs on the distributed clusters, where counting the number of lines in the file PBS_NODEFILE gives the number of processors available to the job.  That number can then be specified on the mpiexec command to make sure that the MPI processes are started on the nodes that the batch system has reserved for the job.  However, on most WestGrid systems, mpiexec will use the correct number of processors by default.  Echoing the contents of PBS_NODEFILE in a job script is still useful for debugging.

 

Monitoring Jobs

There are a variety of commands available for monitoring jobs to see if they are waiting in an input queue or have started running. The main command for this purpose is showq (or showq -u your_user_name to see just your jobs). qstat -f job_id is a useful command for seeing the CPU time and memory usage of a particular job. Obtaining accurate information about running jobs, such as memory usage and load balance of parallel jobs is not always straightforward and may be system specific. 

Visit the Job Monitoring Page for more detailed information.

Deleting Jobs

The qdel command can be used to delete unwanted jobs. Specify the JOBID of one or more jobs as the argument(s) to qdel. The JOBID can be obtained using one of the monitoring commands described in the previous section, such as showq -u your_username . Example:

qdel 12134

Job Priorities

A large number of factors determine the priority of a given job waiting to run in an input queue. The main principle for determining priority is called Fairshare, in which each user is assigned a target percentage of the resources. Priorities are adjusted to guide the scheduler toward these targets so that over time on a busy system, each user would get a "fair" share of the available computing resources. Additional details of the priority calculation will be given as time permits.

Job Accounting

The resource use from a job will be sent if you specify notifications by email.

#PBS -M unknown@university.ca
#PBS -m bea

Alternatively, an epilogue script can be run to include the information about the job in the standard output. To do this, create a file (for example, epilogue.script) in your home directory, with the content,

#!/bin/sh
echo "Epilogue Args:"
echo "Job ID: $1"
echo "User ID: $2"
echo "Group ID: $3"
echo "Job Name: $4"
echo "Session ID: $5"
echo "Resource List: $6"
echo "Resources Used: $7"
echo "Queue Name: $8"
echo "Account String: $9"
echo ""
exit 0

Make sure that the script is executable.

chmod u+x epilogue.script

We have found that the script will not execute if the file permissions include group write permission, so, suggest you also use:

chmod go-rwx epilogue.script

Include in your batch options,

#PBS -l epilogue=/home/myUserID/epilogue.script

The name of the script and its location are arbitrary. They just need to be specified.

To see statistics regarding your usage of WestGrid systems, log in to the Compute Canada database site.

Working Interactively

The qsub command is usually used for submitting batch jobs, but, it can also be used to reserve processors for interactive work. Use the -I option to request an interactive job:

qsub -I

In this case, qsub will not exit as it usually does, but, will indicate that it is waiting for a processor to be assigned to your job. When the processor on a computational node becomes available a message is printed to indicate that the job is "available" and a shell prompt will appear. You may then run your program and other commands interactively for a duration up to the default walltime limit. This limit varies from system to system, but, is usually at least three hours.

If you decide it is taking to long for a processor to become available, you can abort the job by typing "control-C" (hold down the control key while typing the letter C). You might then like to try again with a shorter walltime in order to improve the chances that the scheduler will be able to start your job quickly. For example, to request 30 minutes, type:

qsub -I -l walltime=00:30:00

It is important to kill any processes you have started during your interactive session, before the walltime limit has been reached. Type exit at the prompt to end your interactive session and the corresponding job.

If you need to run graphical applications during your interactive session and you have turned on X Window tunnelling in your ssh terminal session, on most WestGrid systems, the X Window support can be extended to your interactive job by adding a -X argument to the qsub command line:

qsub -I -X -l walltime=00:30:00

It is possible to work interactively with more than one processor, although the procedures for doing this vary from system to system and it is easy to accidently start processes on nodes to which your job has not been assigned. Please contact support@westgrid.ca for more information before trying this.

On all of the WestGrid systems there is some provision for doing interactive work on the login servers or through nodes assigned with qsub -I. Refer to the login banner for guidelines, or look at the system-specific QuickStart Guides for more information.


Updated:
2014-01-16 - Corrected an error in the description of the qsub procs=16 example.
2015-03-03 - Added note about file permissions for epilogue script.
2015-09-30 - Added file resource parameter note.