You are here

Parallel

Parallel System Status

Post date System Status: Update Notes
2013-10-31 - 17:04 PDT Conditions

Parallel available for testing

Parallel logins have been renabled after the long outage for file system and other system upgrades. The system is still being tested so should not be regarding as in full production yet. Most old Open MPI-based code will work without recompilation if module load openmpi/old is added to job scripts.  Old GPU-based code will need module load cuda/4.1 to work without recompilation.  We encourage you to log in to test and recompile your code as necessary, reporting problems to support@westgrid.ca.

For some additional background on the upgrades and current issues, please click here.

2013-10-22 - 05:30 PDT Offline

Maintenance continues

Maintenance on Parallel continues due to unresolved file system issues.  Click here for additional details. Sorry for the inconvenience.

2013-10-13 - 22:18 PDT Offline

Maintenance continues

Ongoing file system checks have prolonged last week's maintenance window.  Sorry for the inconvenience. It is not known when Parallel will be opened to users.  This message will be updated when more information becomes available.

2013-09-27 - 14:29 PDT Downtime Scheduled

Maintenance scheduled October 7 to October 14

Parallel will be down for a major system upgrade starting on Monday, October 7 at 8:30 AM and will be out of service until Monday, October 14.

The TORQUE and Moab batch scheduling system will be upgraded during that period. Please ensure that jobs submitted in the last few days before the maintenance have a walltime parameter that is sufficiently small to allow the jobs to finish before the system goes down. Any jobs waiting to run on the morning of October 7 will be deleted from the queue and will have to be resubmitted when the system comes back up.

Logins will not be available during the maintenance window, so, please make sure to retrieve any files you need prior to October 7.

Due to changes in shared libraries you may have to recompile and test software after the upgrade.  Major packages in /global/software will be run on a test cluster in the week before the upgrade and recompiled if necessary, but, for some of the less frequently used software we will have to rely on the researchers to run their own tests and report any problems encountered.

Sorry for the inconvenience.

2013-09-23 - 15:56 PDT Online

2013-09-23 Job scheduling resumed on Parallel

Scheduling has been resumed on Lattice, Parallel and Breezy after addressing a file system problem that affected quotas.  Resolution of the problem required rebooting file servers and resulted in failure of some jobs to write output files.  Please check output carefully and resubmit any suspect jobs. Sorry for the inconvenience.

2013-09-19 - 20:01 PDT Conditions

2013-09-18 Job scheduling paused on Parallel

Scheduling has been paused on Lattice, Parallel and Breezy pending a solution to a problem affecting IBRIX file system administrative operations. This does not yet appear to have impacted running jobs, except in an isolated case in which a storage quota was incorrectly reported as being exceeded.  Normal service will be resumed as soon as possible.  Please accept our apologies for the service interruption.

2013-08-12 - 21:15 PDT Online

System functioning normally

File system problems reported July 31 have been resolved.

Pages