You are here

WestGrid Computing Facilities

Introduction

The WestGrid computing facilities are distributed among several resource provider sites, with some specialization at each site. WestGrid is connected by high-performance networks so that users can access the system which best fits their needs, regardless of where it is physically located.

WestGrid provides several types of computing systems, since different users' programs will run best on different kinds of hardware. The systems are for high performance computing, so they are something beyond what you would find on a desktop. We have clusters, clusters with fast interconnect, and shared memory systems. Use the system which best fits your needs, not necessarily the one closest to you. Anything else is less than optimal and a waste of valuable resources.

See the QuickStart Guide for New Users introduction to choosing the most appropriate system. For more detailed information about the differences between the WestGrid systems, read through the pages in this section.

Serial programs can run on one CPU or core of a compute cluster. Some researchers may have serial programs which need to be run many times.  In this case multiple copies can be run simultaneously on a cluster.

Parallel programs have multiple processes or threads running at the same time which have some need to communicate with each other. Then the important distinction is how much they need to communicate and how quickly they need to do it.

In order of increasing demands, those programs can run on a regular cluster, a cluster with a fast interconnect, or a shared memory machine. This depends on how the program is written (MPI, OpenMP, threads, etc). How well a parallel program scales will determine how many nodes of a cluster or machine that program should be run on.

Other factors will also affect the decision of which system to run on. For example, the amount of memory available (particularly the amount of memory per processor that is needed), the software which is installed, the restrictions due to software licensing, etc.

WestGrid also has specialized systems. For example, ones with special visualization capabiliites, GPUs, etc. See the QuickStart Guides for more information about each system.

WestGrid's computing facilities are part of Compute Canada's national platform of HPC resources. A list of compute servers within this platform can be found here, and a list of storage resources can be found here.   

 

System(s) Site Cores Type Details
Breezy University of Calgary 384 Shared memory
Appro
24 nodes: quad-socket, 6-core AMD 2.4GHz nodes
256 GB per node
Infiniband 4X QDR
IBRIX filesystem
Bugaboo Simon Fraser University 4584 Storage, Cluster with fast interconnect
Dell
160 nodes: 8 cores, Xeon X5430 with 16 GB/node = 1,280 cores (Infiniband, 2:1 blocking)
254 nodes: 12 cores, Xeon X5650 with 24 GB/node = 3,048 cores (Infiniband, 2:1 blocking)
32 nodes: 8 cores, Xeon X5355 with 16 GB/node = 256 cores (Ethernet GigE)
Glacier University of British Columbia Cluster
  • Special Request Only
  • Legacy system maintained on best effort basis.  32 Bit and slow GigE interconnects.
  • About 700 active nodes (faulty nodes are retired over time)
  • some with 2GB and some with 4GB
  • Intel Xeon 32-bit processors
  • GigE Ethernet
  • GPFS 14 TB in a SAN.
Grex University of Manitoba 3792 Storage, Cluster with fast interconnect
  • SGI Altix XE 1300
  • 316 compute nodes
  • 2 x 6core Intel Xeon X5650 2.66 MHz processors per node
  • 24 nodes have 96 GB, 292 nodes have 48 GB
  • Infiniband 4X QDR
Hermes/Nestor University of Victoria 4416 Storage, Cluster with fast interconnect

Hermes

  • Original nodes: 84 x 8 core, IBM iDataplex X5550 2.67 GHz, 24 GB/node, 2 x GigE interconnects
  • Newer nodes: 120 x 12 core, Dell C6100 servers, 2.66 GHz X5650 cores with 24 GB/node, QDR IB 10:1 blocking
  • GPFS 1.2 PB for home, scratch (shared with nestor)

Nestor

  • 288 x 8 core/node, IBM iDataplex X5550 2.67 GHz, 24 GB/Node
  • QDR IB nonblocking
  • GPFS 1.2 PB for home, scratch (shared with hermes)
Hungabee University of Alberta 2048 Shared memory
  • Special Request Only
  • SGI UV1000, NUMA Shared-memory
  • 2048 Intel Xeon E7 cores
  • 16 TB total (shared) memory
  • NFS: 2 x SGI IS5000 storage arrays
    • 8 x fibrechannel direct to the UV1000. (short term storage)
    • 50 TB
  • Lustre: 1 x SGI IS16000 array 355 TB. (medium term storage)
    • Available to BOTH Hungabee and Jasper through QDR IB
Jasper University of Alberta 4160 Cluster with fast interconnect
  • SGI Altix XE, 400 nodes, 4160 cores and 8320 GB of memory
    • 240 Xeon X5675 nodes - 12 cores (2 x 6), 24 GB, 40 Gbit/sec 1:1 Infiniband interconnect
    • 160 Xeon L5420 nodes - 8 cores (2 x 4), 16 GB, 20 Gbit/sec 2:1 Infiniband interconnect
  • Lustre parallel distributed filesystem, 356 TB - shared with all nodes via Infiniband
Lattice University of Calgary 4096 Storage, Cluster with fast interconnect
  • 512 x 8-core nodes.
    • Intel Xeon L5520 quad core 2.27 GHz
    • 12 GB/node
  • QDR IB (2:1 blocking factor)
Orcinus University of British Columbia 9600 Storage, Cluster with fast interconnect
  • Phase 1: 384 nodes, 3072 cores
    • 8 cores/node
    • Xeon E5450 3.0GHz
    • 16 GB Ram
    • DDR IB
  • Phase 2: 554 nodes, 6528 cores
    • 12 cores/node
    • Xeon X5650 2.66 GHz
    • QDR IB
  • IB with 2:1 blocking factor
  • Phase 1 and Phase 2 share filesystems but otherwise run as separate systems
Parallel University of Calgary 7056 Storage, Cluster with fast interconnect, Visualization
  • HP ProLiant SL390
  • 528 x 12 core nodes
    • Intel E5649 (6 core) 2.53 GHz
  • 60 special 12 core nodes with GPU
    • NVidia Tesla M2070s (5.5 GB ram and Compute Capability 2)
  • IB QDR (2:1 blocking factor to reduce costs)
  • Global scratch shared between breezy, lattice and parallel
Silo University of Saskatchewan Storage
  • 3.15 PB usable
  • /home is backed up on IBM tape system
  • silo.westgrid.ca for file and data transfers
  • hopper.westgrid.ca for data post-processing

List of facilties by general type

  • Storage
    • USask Storage Facility -- the primary storage site
    • UVic Storage Facility and SFU Storage facility -- for use in special cases where there is a need for large storage close to the compute nodes
  • Shared memory
    • Hungabee
  • Cluster
    • Glacier, Hermes, Breezy (large memory)
  • Cluster with fast interconnect
    • Bugaboo, Checkers, Grex, Jasper, Lattice, Nestor, Orcinus, Parallel
  • Visualization
    • Checkers and Parallel both have special nodes with Graphics Processing Units (GPUs).

Retired Systems

Some older WestGrid systems have been removed from general service, typically being replaced with more energy-efficient machines with more capability.

Machine name Period of Service Description

Snowpatch

Mar. 2009 - ?


The Snowpatch cluster of 32 computenodes with 16GB of memory and 8 cores each was incorporated into the Bugaboo cluster in Apr. 2012. Thus, this system was not actually retired, the hardware is still in production. The system just no longer exists as a separate cluster.

Gridstore / Blackhole 

Gridstore:
Jul. 2003 - Aug. 2011

Blackhole:
Jul. 2003 - Apr. 2010

The Gridstore/Blackhole facility provided the primary storage services for WestGrid until that function was moved to Silo.  See the WestGrid Data Storage page for details about Silo and other WestGrid storage facilities.

Dendrite / Synapse

Synapse:
Apr. 2005 - Oct. 2011

Dendrite:
Apr. 2005 - June 2010

Dendrite, one of a pair of IBM Power5-based used for large shared-memory parallel programs was decommissioned after hardware problems. Synapse, with 256 GB of RAM was available through the Cortex front end until the end of October 2011, when it was decomissioned as well. Breezy and soon to be available Hungabee are other machines appropriate for large-memory serial or single-node threaded parallel programs (such as those based on OpenMP).

Hydra

Dec. 2003 - Jan. 2011 


This SGI visualization server was a testbed for remote visualization applications for several years. Visualization services now focus on several GPU-equipped nodes of the Checkers cluster.

Lattice

(not to be confused with a current machine with the same name!)

2003 - Oct. 2009 


Lattice was a cluster consisting of 36 HP ES45 nodes, 19 HP ES40 nodes, and one additional HP ES45 node dedicated for interactive jobs. Each node had 4 Alpha CPUs and 2-8 GB of memory. The CPUs were clocked at 0.67-1.25 GHz. They provided good floating point performance for their time. The ES45 nodes used a Quadrics interconnect which provided much lower latency and higher bandwidth than commodity networks, making Lattice suitable for demanding parallel jobs.  The current Lattice cluster is targeted at similar jobs, but, has an InfiniBand interconnect.

Lattice was also the home to the commercial Gaussian license for WestGrid.  This service is now provided on Grex.


Tantalus

Apr. 2005 - Oct. 2010


This Cray XD1 Linux system was unique for WestGrid at the time because it had programmable hardware, in the form of on-board field programmable gate arrays (FPGAs), and an extremely fast interconnect with a very low MPI latency. Each of its six nodes consisted of two single-core Opteron CPUs (2.2GHz), one Xilinx Virtex II Pro FPGA (XC2VP50 device, FF1152 package), and 4GB of memory.

Matrix Jul. 2005 - Mar. 2011
Matrix was a 256-core HP cluster (128 dual-core AMD Opteron-based nodes, running at 2.4 GHz, with 2 GB of RAM per node).  It used an Infiniband interconnect.  Its intended use was MPI-based parallel processing.  This kind of processing is now provided by Lattice and several other WestGrid clusters.

Nexus Sept. 2003 - Feb. 2011 Nexus and related SGI servers provided the main large-memory capability for WestGrid for many years.  Current alternatives for large-memory programs include Breezy and Hungabee.
Robson Oct. 2004 - Aug. 2011
Robson was a small (56-core) cluster based on 1.6 GHz PowerPC 970 processors in an IBM JS20 BladeCentre configuration.  Each 4-core compute node (blade) shared 4 GB of RAM.  The system was used for a wide range of jobs including serial jobs and parallel jobs with low interprocess communications requirements.  Unique features of Robson include direct access to a large storage facility and a batch environment with a queue for preemptible jobs.


Cortex

2005 - Oct. 2011
Cortex was the head node for a collection IBM pSeries SMP machines ranging from 4 up to 64 processor cores, with different speeds and memory sizes. All the systems ran the IBM AIX operating system. Cortex was suitable for parallel applications that required a large shared memory and/or fast communications between processes, or for serial applications that required a large shared memory.

Terminus Jul. 2008 - Dec. 2012
Terminus will continue to operate in 2013 at the University of Calgary for researchers there but will no longer be available for WestGrid use.