You are here

Grex

Grex System Status

Post date System Status: Update Notes
2017-06-19 - 07:44 PDT Conditions

Grex Lustre filesystem runs on half of OSS

User access and Lustre are restored; however, one of the Lustre servers was taken out for further investigation. Most of the running jobs should not have been affected.

2017-05-30 - 12:15 PDT Online

Grex is fully operational

Grex is fully operational.

2017-05-29 - 14:15 PDT Conditions

Intermittent problems with Grex internal Ethernet switch

Grex's internal GigE switch is intermittently unavailable due to what appears to be firmware issues. This affects availability of internal Grex services such as Torque and Moab servers, LDAP authentification and such. Running and queued jobs and data should  not be affected, but logins to the system and commands like showq, qstat and qsub might intermittently fail. We are working on resolving the issue.

2017-05-21 - 17:24 PDT Conditions

Grex is open to test access

Our works on Grex storage update are almost complete. The system is open for access and running jobs, for now in test mode. More updates on status and documentation is to follow. Please contact support@westgrid.ca if you experience problems using it or accessing your data! Please CLICK HERE for more details on the new filesystem.

2017-05-19 - 21:14 PDT Offline

Grex is down for the planned storage outage.

Grex is down, user login access is not available, batch queues are stopped while we migrate to our new Lustre storage.

The outage is extended because of the new Lustre stability issues.

2017-05-17 - 06:22 PDT Offline

Grex is down for the planned storage outage.

Grex is down, user login access is not available, batch queues are stopped while we migrate to our new Lustre storage. ETA for putting Grex back online is Friday , May Nineteenth.

2017-05-09 - 08:38 PDT Conditions

Grex storage outage planned for Wed. May 17.

We will have Grex  outage on 8:30AM, Wed. May 17 to put online new Lustre storage. All the compute nodes will be reprovisioned with new Lustre filesystem client, and rebooted so jobs will be lost. A reservation is in place to prevent new longer jobs from starting. Users can adjust their walltime for jobs to end prior to outage for better throughput.

Pages