You are here

Bugaboo

Bugaboo System Status

Post date System Status: Update Notes
2017-02-22 - 14:46 PST Downtime Scheduled

Bugaboo downtime Feb. 27

On Feb. 27 at high-voltage relay will be installed in the data centre that houses the Bugaboo facility. The Bugaboo system will be shutdown at 5am (Pacific) and the power outage will last all day. This work had been scheduled for January but had to be postponed because an incorrect part was delivered. This is the last part of upgrades to the data centre in preparation of the new Cedar facility.

2017-02-15 - 13:39 PST Online

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

Update Feb. 3

Unfortunately, the replacement disk shelf did not arrive before the weekend. This means that we are wasting another weekend without even having the chance to bring the system back up. Work will resume as soon as the disk shelf gets delivered (not before Monday morning).

Update Feb. 7

The new disk shelf has arrived on site and has been installed. All disks in this shelf are now in a "verify" state. After the verify scans are completed a few of the disks need to be rebuilt. All of this may still take a day or two. However, the state of the system appears to be stable now. Thus, there is a good chance that after the rebuilds are completed the system can be made available again.

Update Feb. 15

The hardware problems that have caused the extended downtime of the Bugaboo system have been resolved. However, these hardware problems have caused data corruption on one part of the storage system that supplies the /global/scratch filesystem. Files that are affected cannot be restored.

Technical details: the /global/scratch filesystem consists of 54 building blocks called "object storage target" (OST). One of these OST was corrupted in a way that could not be repaired. Any file that was at least partially stored on that OST is corrupted and cannot be restored.

We will be supplying a list of affected files in your home directory in the file "corruptedfiles.20170215". We are still generating these lists - expect to find the list either today (Feb. 15) or tomorrow. There are still remnants of these files: when using the "ls -l" command these files are listed with "?" in the columns for permissions, owner, etc.; the command "ls -l filename" shows an error message "filename: Cannot allocate memory" for corrupted files. In almost all cases these files need to be removed with the "unlink filename" command - no other command other than maybe the "head" command can be used with these files, e.g., the "rm" command does not work. Please send email to support@westgrid.ca, if you need help with dealing with these files.

All running jobs needed to be terminated. Please resubmit the jobs.

If you have any questions in this regard send email to support@westgrid.ca. We apologize for the problems that these hardware failures and the corruption have caused you.

2017-02-07 - 16:22 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

Update Feb. 3

Unfortunately, the replacement disk shelf did not arrive before the weekend. This means that we are wasting another weekend without even having the chance to bring the system back up. Work will resume as soon as the disk shelf gets delivered (not before Monday morning).

Update Feb. 7

The new disk shelf has arrived on site and has been installed. All disks in this shelf are now in a "verify" state. After the verify scans are completed a few of the disks need to be rebuilt. All of this may still take a day or two. However, the state of the system appears to be stable now. Thus, there is a good chance that after the rebuilds are completed the system can be made available again.

2017-02-03 - 16:44 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

Update Feb. 3

Unfortunately, the replacement disk shelf did not arrive before the weekend. This means that we are wasting another weekend without even having the chance to bring the system back up. Work will resume as soon as the disk shelf gets delivered (not before Monday morning).

2017-01-27 - 18:46 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

2017-01-19 - 19:05 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

2017-01-12 - 21:59 PST Online

System fully operational

Finished on January 13, 2017 - 5:59 GMT

Pages