Computing Facilities
Introduction
The WestGrid computing facilities are distributed among several resource provider sites, with some specialization at each site. WestGrid is connected by good networks so that users can use the system which best fits their needs, regardless of where it is physically located.
WestGrid provides several types of computing systems, since different users' programs will run best on different kinds of hardware. The systems are for high performance computing, so they are something beyond what you would find on a desktop. We have clusters, clusters with fast interconnect, and shared memory systems. Use the system which best fits your needs, not necessarily the one closest to you. Anything else is less than optimal and a waste of valuable resources.
See the QuickStart Guide for New Users introduction to choosing the most appropriate system. For more detailed information about the differences between the WestGrid systems, see the pages in this section.
Serial programs can run on one CPU or core of a compute cluster. Some researchers have a serial program which they need to run many times; they can run multiple copies simultaneously on a cluster.
Parallel programs have multiple programs running at the same time which have some need to communicate with each other. Then the important distinction is how much they need to communicate and how quickly they need to do it.
In order of increasing demands, those programs can run on a regular cluster, a cluster with a fast interconnect, or a shared memory machine. This depends on how the program is written (MPI, OpenMP, threads, etc). How well a parallel program scales will determine how many nodes of a cluster or machine that program should be run on.
Other factors will also affect the decision of which system to run on. For example, the amount of memory available, the software which is installed, the restrictions due to software licensing, etc.
WestGrid also has specialized systems. For example, ones with special visualization capabiliites, GPUs, etc. See the QuickStart Guides for more information about each system.
List of facilities by location
- Simon Fraser University
- Snowpatch, Bugaboo, Storage facility
- University of Alberta
- Checkers
- University of British Columbia
- Glacier, Orcinus
- University of Calgary
- Breezy, Terminus, Lattice
- University of Saskatchewan
- Storage facility (Silo and Hopper)
- University of Victoria
- Hermes, Nestor, Storage facility
List of facilties by general type
- Storage
- USask Storage Facility -- the primary storage site
- UVic Storage Facility and SFU Storage facility -- for use in special cases where there is a need for large storage close to the compute nodes
- Shared memory
- Hungabee - in production beginning 2012
- Cluster
- Glacier, Snowpatch, Hermes, Breezy (large memory)
- Cluster with fast interconnect
- Terminus, Bugaboo, Checkers, Orcinus, Nestor, Lattice, Grex (large number of cores/node)
- Visualization
- Checkers [special nodes with GPUs]
Future Plans
These systems are being purchased, installed, or configured now:
- UBC - orcinus expansion plus disk.
RFP process complete, vendor selected, more compute, storage. expected delivery Mar - Apr 2011.
More to come in 2011:
- SFU parts 2 and 3 of compute and storage. RFP evaluation completed.
- UofA part 2 of compute and storage. Also shared memory. In production beginning f 2012
- UVic parts 2 and 3 of storage, part 2 of serial.
- UofCalgary part 2 or storage and compute. Also "new architecture"
- UofS Storage parts 3
Retired Systems
Some older WestGrid systems have been removed from general service, typically being replaced with more energy-efficient machines with more capability.
| Machine name | Period of Service | Description |
|
Gridstore/Blackhole |
Gridstore: Jul. 2003- Blackhole: |
The Gridstore/Blackhole facility provided the primary storage services for WestGrid until that function was moved to Silo. See the WestGrid Data Storage page for details about Silo and other WestGrid storage facilities. |
| Dendrite/Synapse |
Synapse: April 2005 Dendrite: April 2005 June 2010 |
Dendrite, one of a pair of IBM Power5-based used for large shared-memory parallel programs was decommissioned after hardware problems. Synapse, with 256 GB of RAM was available through the Cortex front end until the end of October 2011, when it was decomissioned as well. Breezy and soon to be available Hungabee are other machines appropriate for large-memory serial or single-node threaded parallel programs (such as those based on OpenMP). |
|
Hydra |
Dec. 2003- Jan. 2011 |
This SGI visualization server was a testbed for remote visualization applications for several years. Visualization services now focus on several GPU-equipped nodes of the Checkers cluster. |
|
Lattice (not to be confused with a current machine with the same name!) |
2003- Oct. 2009 |
Lattice was a cluster consisting of 36 HP ES45 nodes, 19 HP ES40 nodes, and one additional HP ES45 node dedicated for interactive jobs. Each node had 4 Alpha CPUs and 2-8 GB of memory. The CPUs were clocked at 0.67-1.25 GHz. They provided good floating point performance for their time. The ES45 nodes used a Quadrics interconnect which provided much lower latency and higher bandwidth than commodity networks, making Lattice suitable for demanding parallel jobs. The current Lattice cluster is targeted at similar jobs, but, has an InfiniBand interconnect. Lattice was also the home to the commercial Gaussian license for WestGrid. This service is now provided on Grex. |
| Matrix | July 2005- Mar. 2011 | Matrix was a 256-core HP cluster (128 dual-core AMD Opteron-based nodes, running at 2.4 GHz, with 2 GB of RAM per node). It used an Infiniband interconnect. Its intended use was MPI-based parallel processing. This kind of processing is now provided by Lattice and several other WestGrid clusters. |
| Nexus | Sept. 2003 Feb 28 2011 |
Nexus and related SGI servers provided the main large-memory capability for WestGrid for many years. Current alternatives for large-memory programs include Breezy and, coming in 2012, Hungabee. |
| Robson | Oct. 2004- Aug. 2011 | Robson was a small (56-core) cluster based on 1.6 GHz PowerPC 970 processors in an IBM JS20 BladeCentre configuration. Each 4-core compute node (blade) shared 4 GB of RAM. The system was used for a wide range of jobs including serial jobs and parallel jobs with low interprocess communications requirements. Unique features of Robson include direct access to a large storage facility and a batch environment with a queue for preemptible jobs. |
Updated 2011-11-09.
