Lattice QuickStart Guide
About this QuickStart Guide
This QuickStart guide gives a brief overview of the WestGrid Lattice facility, highlighting some of the features that distinguish it from other WestGrid resources. It is intended to be read by new WestGrid account holders and by current users considering whether to move to the Lattice system. For more detailed information about the Lattice hardware and performance characteristics, available software, usage policies and how to log in and run jobs, follow the links given below.
Introduction
The Lattice cluster is intended for large-scale parallel applications that can take advantage of its InfiniBand interconnect.
The original Lattice was an HP high-density cluster with 4096 cores, but, a major upgrade arrived in December 2011, providing an additional 7248 cores. The new system also includes 180 general purpose graphics processing units (GPGPUs) and additional storage. See the hardware section below for details.
The new hardware will be phased in during the early part of of 2012. This page will be updated as changes are made.
Here is a picture of Lattice during installation in July 2010 (click for a larger image in a new window).
Request for access
Unlike most WestGrid systems, a separate request is required to obtain a WestGrid account on Lattice. If you think the software you would like to run is appropriate for the Lattice cluster, please write to accounts@westgrid.ca with a subject line of the form "Lattice account request (your_username)" with a request for an account and a mention of the software you propose to use.
Hardware
Processors
Along with a login node and other machines for job management and file serving, the main part of the Lattice cluster consists of multicore compute nodes.
After the upgrades in early 2012, the Lattice cluster will be comprised of three types of compute nodes: 512 8-core nodes, 544 12-core nodes and 60 12-core nodes with 3 GPGPUs each.
From the original Lattice cluster, there are 512 8-core computational nodes, each having 2 sockets. Each socket has an Intel Xeon L5520 (Nehalem) quad-core processor, running at 2.27 GHz. The 8 cores associated with one of the individual nodes share 12 GB of RAM.
The new nodes are based on the HP Proliant SL390 server architecture, with each node having 2 sockets. Each socket has Intel X5649 (Westmere) processor, running at 2.53 GHz. Unlike the older processors, however, the new ones have 6 cores each. The 12 cores associated with one compute node share 24 GB of RAM.
Interconnect
As mentioned in the introduction, the Lattice cluster is intended for multi-node jobs that make use of the low-latency interconnect. Lattice uses an InfiniBand 4X QDR (Quad Data Rate) 40 Gbit/s switched fabric, with a two to one blocking factor (to reduce purchase costs).
Storage
There is approximately 5 TB of disk space allocated for home directories and 70 TB of global scratch space (although that will be increased substantially during an upgrade in December 2011). A subdirectory for each user is available in /global/scratch.
There is a storage quota of 50 GB (with a 100,000-file limit) for your home directory and 450 GB for global scratch (also with a 100,000-file limit). If you need an increased quota, please write to WestGrid support.
In addition, each compute node has a local scratch directory, /tmp, with approximately 120 GB of storage space. It is very important for performance reasons to use /tmp for I/O intensive programs. Note that these local scratch partitions are shared among all users of a given node, so, you are not guaranteed that all the space will be available for any given run.
Software
See the main WestGrid software page for tables showing the installed application software on Lattice and other WestGrid systems, as well as information about the operating system, compilers, and mathematical and graphical libraries.
Please write to WestGrid support if there is additional software that you would like installed.
Using Lattice
Getting started
To log in to Lattice, connect to lattice.westgrid.ca using an ssh (secure shell) client. For more information about connecting and setting up your environment, see the QuickStart Guide for New Users.
Batch job policies
As on other WestGrid systems batch jobs are handled by a combination of TORQUE and Moab software. For more information about submitting jobs, see Running Jobs.
Unlike most other WestGrid systems, we prefer that the syntax "-l nodes=xx,ppn=8" be used rather than "-l procs=yyy" when requesting processor resources on Lattice. Lattice is used almost exclusively for large parallel jobs that use whole nodes. This has the potential of improving the performance of some jobs and minimizes the impact of a misbehaving job or hardware failure. Since there are 8 cores per node on Lattice, a ppn (processors per node) parameter of 8 will request that all the processors on a node be used. Also, it is recommended that you ask for 10-11 GB of memory per node requested, using the mem parameter. So, a typical job submission on Lattice would look like:
qsub -l nodes=4:ppn=8,mem=40gb,walltime=72:00:00 parallel_diffuse.pbs
The following limits are in place for batch jobs submitted to the default queue (that is, if no queue is specified on qsub command):
| Resource |
Policy or limit |
| Maximum walltime, but, see below for other comments related to walltime. | 168 hours |
| Suggested maximum memory resource request, mem. | 11 GB |
| Maximum number of running jobs for a single user | 64 |
| Maximum cores (sum for all jobs) for a single user | 1024 |
| Maximum jobs in Idle queue | 8 |
Six nodes (cn004 to cn009) are reserved for short batch jobs. If you submit a job with a walltime resource of up to 3 hours, it will be considered for allocation to cn004 and cn005. Nodes cn006 through cn009 accept jobs with a walltime of up to 24 hours.
Interactive jobs
Two nodes (cn002 and cn003) are reserved for interactive use. These can be accessed by specifying the interactive queue and a walltime of less than or equal to three hours on your qsub command line. If you require exclusive access to a node, you can add naccesspolicy=singlejob, as shown here:
See the Working Interactively section of the Running Jobs page for an alternate method to reserve processors for interactive use, which can be used in cases where you need more than two nodes or a walltime longer than 3 hours.
The login node can be used for short testing and debugging sessions, but, a virtual memory limit of 6 GB per process has been imposed to reduce the chance of a user making the login node unusable for others. If you need to test a program that has a process requiring more virtual memory than that you could one the compute nodes.
Updated 2011-12-12.

