QuickStart Guide to Orcinus

About this QuickStart Guide

This QuickStart guide gives a brief overview of the WestGrid Orcinus facility, highlighting some of the features that distinguish it from other WestGrid resources. It is intended to be read by new WestGrid account holders and by current users considering whether to move to the Orcinus system. For more detailed information about the Orcinus hardware and performance characteristics, available software, usage policies and how to log in and run jobs, follow the links given below.

Introduction

Orcinus is an HP blade-based cluster with 3072 cores.

Hardware

Processors

The Orcinus cluster is comprised of 12 chassis, each containing 16 blades.  There are two compute servers on each blade.  Each server has two sockets, each containing an Intel Xeon E5450 quad-core processor, running at 3.0 GHz.  Multiplying that all out (12*16*2*2*4) gives 3072 cores for computations. The 8 cores associated with one of the individual servers share 16 GB of RAM.

Interconnect

The compute nodes are connected with infiniband with 2 to 1 blocking..

Storage

The /global/scratch and /global/home file systems are temporarily being served from Glacier over a 10 Gb/s Ethernet link.

There is local storage of 81 GB associated with each node.

Software

See the WestGrid software page for a list of software installed on Orcinus.

Please write to support@westgrid.ca if there is additional open-source (or free) software that you would like installed.  In the case of commercial software, please fill in the request form.

Using Orcinus

To log in to Orcinus, connect to orcinus.westgrid.ca using an ssh (secure shell) client. For more information about connecting and setting up your environment, see the QuickStart Guide for New Users.

The general WestGrid Programming and Running Jobs pages also apply to the Orcinus cluster.  Orcinus-specific pages may be developed as time permits.

Some guidelines for running and monitoring batch jobs on Orcinus are given here.

Batch jobs

On most WestGrid systems, the TORQUE command, qsub, is used to submit jobs, as explained on the Running Jobs page.  Although TORQUE is also used on Orcinus, the Moab resource manager msub command can also be used for submitting batch jobs.

The Moab, showq command can be used to monitor jobs. By default, showq shows information about all the jobs on the system.  For a large cluster such as Orcinus, this can be a long list.  So, you may prefer to use showq in the form

showq -u username

to limit the output to information about the jobs belonging to the given username.  As on the Glacier cluster, an alternative is to use the qsort utility, which defaults to showing just your own jobs.

The default queue limits are:

  • mem = 1000mb
  • walltime = 03:00:00  (that is, 3 hours)

These are the values that are used for these resources limits if your batch job submission script or qsub/msub command line arguments do not specify a memory or elapsed time (walltime) limit.

Working interactively

Some interactive work is allowed on the Orcinus login machines for editing files, compiling programs, limited debugging, etc.

In addition, there are two 8-core compute nodes reserved for short debugging jobs.  To access these nodes, use the -l qos=debug resource request on the qsub/msub command line or in directives in your batch job submission script:

#PBS -l qos=debug
#PBS -l walltime=mm:ss

Note: for a serial job (using a single core) the maximum walltime limit is 45:00 (45 minutes). Jobs using up to 2 compute nodes, for a total of 16 cores are allowed through the debug quality of service request, but, there is a limit of 14400 processor-seconds total (number of processors times the walltime limit, in seconds, must not exceed 14400). So, as an example, for a 16 core-job the maximum allowed walltime specification is 15 minutes.

Another possibility for interactive work is to use an interactive batch job in which the batch job system is used to reserve processors for interactive work.  At the moment this intearctive job feature is enabled only on the Seawolf1 login node.

To start an interactive batch job session, use a qsub command of the form:

qsub -I -l walltime=mm:ss,qos=debug

where, as above, the processors*walltime limit must not exceed 14400 processor-seconds.


Updated 2009-12-06.