System Notices

Tue Dec 17 Nexus Cluster Outage - Wen 18 Dec Back in production

 Tue Dec 17 Nexus Cluster Outage:

Nexus and all SGI machines at UofA are unaccessible due to the hardware
problems on a disk array serving global file systems. This equipment is
no longer covered by any warranty or support agreement with the vendor.
We are trying to revive the faulty hardware, and will announce
machine(s) availability to the users as soon as the problem is fixed.

 

UPDATE 16:29 PST 17 DEC

The file system has been fixed.

"We think that machines will be back up tomorrow around noon. We are
taking this opportunity to clean up our CXFS settings."

 UPDATE 11:15 PST 18 DEC

 Nexus, all SGI machines and filesystems are back in productio.

 

 

Tue Dec 22 09:00:00 MST: Scheduled outage on Checkers cluster. (Back in production)

Update Tue Dec 22 15:30:00 MST:

       Checkers cluster is back in production.

Tue Dec 22 09:00:00 MST:                                                   
      There will be an scheduled outage on checkers cluster in order to      
      effect changes to the network file system.                             
      Jobs that can not finish before this outage will not be started        
      till after the outage is complete. 

 

Nov 26 10:30 - 15:00 MST, Cortex complex compute nodes Gaunine and Adenine outage due to UPS hardware failure

Nov 26 10:30AM MST

An UPS powering  Guanine and Adenine experianced a spectacular hardware failure.

Jobs running on Adenine and Guanine during the UPS failure were effected.

Guanine and Adenine were been powered off as a result. 

Update  15:00 MST:

Guanine and Adenine are back in production running on utility power.

Orcinus is back online

All maintenance to the Chemistry server room heat exchangers and main chiller has been completed.  Orcinus is online and scheduling jobs once again.  Sorry for any delays and inconveniences.

Orcinus is Offline for Scheduled Maintenance

Orcinus is currently offline and will be down for maintenance until November 19.

Nov 14, 12:05 - Bugaboo available again

Bugaboo is available again. Hardware and software for the Lustre file systems (/home and /global/scratch) have been updated. Please report problems to support@westgrid.ca.

Bugaboo downtime extended until Sat, Nov 14, 12:00

Because of complications during the hardware and software upgrade we are forced to extend the Bugaboo downtime until noon, Saturday, Nov. 14 - 12:00 (Pacific).

Tue, Dec 1, 2009 through Sun, Dec 6, 2009 - Matrix cluster unavailable due to major system upgrade

The Matrix cluster will be unavailable from Tuesday, December 1, 2009 through the following weekend, in order to undertake a major system upgrade.  The system should be available again on Monday, December 7, 2009.

A scheduling reservation will be put in place that will prevent jobs from starting if the specified walltime would extend into the maintenance period.  If there are jobs you can fit in during the days leading up to the shutdown by successively reducing the walltime to fit the shrinking window of opportunity, please do so.

If there are any jobs still running on the morning of Dec 1, they will be killed.

Jobs that are in the input queue at the time of the system shutdown should be handled normally by the batch scheduling software when the system is brought back up after the upgrade.

Fri, Nov 6, 2009, 18:40 (Pacific): Bugaboo available again

The Bugaboo file servers have decided to serve files again. Thus, Bugaboo is available again.

Fri, Nov 6, 2009, 16:00 (Pacific): Bugaboo unavailable

The file servers for the /home and /global/scratch filesystems crashed again.
Syndicate content