You are here

Parallel

Parallel System Status

Post date System Status: Update Notes
2014-07-18 - 08:41 PDT Conditions

breezy/parallel/lattice datacentre brownout

he UofC datacentres experienced an electrical brownout this evening.  A number of jobs were lost on most systems due to compute node reboots on clusters hosted at the UofC.  Sorry for the inconvenience.  We will restart, test, and return to service the affected compute nodes ASAP.

2014-07-09 - 15:48 PDT Online

Lattice/Parallel datacentre overheating - Resolved

Scheduling resumed.  Thanks for your patience

2014-07-08 - 21:00 PDT Conditions

Lattice/Parallel datacentre overheating

2200 Tuesday, July 8: Parallel and Lattice have experienced a high temperature condition in the datacentre.  A significant number of nodes have auto-powered off to prevent thermal damage.  Once the high-temperature condition has been resolved, scheduling will be resumed.  Sorry for the inconvenience.

2014-04-14 - 03:35 PDT Online

Scheduling restarted after disk failure

2014-04-14:

Multiple failing disks on two segments of the IBRIX file system on 2014-03-25 caused data loss in the /global/scratch file system (estimated at 2-3% of the files).

Job scheduling has been restarted after an extensive period of file system rebuilding and checking.  Please note that jobs that refer to damaged or missing files due to this disk crash will fail.  Unfortunately, we are unable to provide a list of the corrupted or missing files.

Please note that WestGrid infrastructure is aging.  Disks from the same production batch can fail at similar times.  Please save files that are important to you by transferring them to the WestGrid backup and archive site or your own system.

 

2014-04-04 - 16:55 PDT Conditions

Breezy, Lattice, Parallel file system issue

2014-03-25:

A failing disk drive on the IBRIX file system is causing some files on the Breezy/Lattice/Parallel complex to be inaccessible.
System staff are working with the IBRIX vendor to resolve the issue.
Please check output from currently running or recently run jobs carefully. Scheduling will be paused to prevent new jobs from failing.
Sorry for the inconvenience.

Update 2014-04-04:

Faulty disk drives have been replaced, but, an extensive period of file system rebuilding and checking is still in progress.

2014-03-25 - 15:11 PDT Conditions

Breezy, Lattice, Parallel file system issue

A failing disk drive on the IBRIX file system is causing some files on the Breezy/Lattice/Parallel complex to be inaccessible.

System staff are working with the IBRIX vendor to resolve the issue.

Please check output from currently running or recently run jobs carefully. Scheduling will be paused to prevent new jobs from failing.

Sorry for the inconvenience.

2013-12-06 - 15:11 PST Online

Parallel back in production

If you are using Open MPI-based code that has not been recompiled since the October 2013 upgrade on Parallel, add

module load openmpi/old

to your job script.  If you recompile your code with the default Open MPI, that module command is not required (and should not be used).

Similarly, GPU-based code will need

module load cuda/4.1

to work without recompilation.

Pages