You are here

Bugaboo

Bugaboo System Status

Post date System Status: Update Notes
2017-02-07 - 16:22 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

Update Feb. 3

Unfortunately, the replacement disk shelf did not arrive before the weekend. This means that we are wasting another weekend without even having the chance to bring the system back up. Work will resume as soon as the disk shelf gets delivered (not before Monday morning).

Update Feb. 7

The new disk shelf has arrived on site and has been installed. All disks in this shelf are now in a "verify" state. After the verify scans are completed a few of the disks need to be rebuilt. All of this may still take a day or two. However, the state of the system appears to be stable now. Thus, there is a good chance that after the rebuilds are completed the system can be made available again.

2017-02-03 - 16:44 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

Update Feb. 3

Unfortunately, the replacement disk shelf did not arrive before the weekend. This means that we are wasting another weekend without even having the chance to bring the system back up. Work will resume as soon as the disk shelf gets delivered (not before Monday morning).

2017-01-27 - 18:46 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

Update, Jan. 27

The Bugaboo storage system has suffered from a series of hardware failures. During the last week several elements of the storage system have been replaced. Even today another controller was substituted. Nevertheless, the storage system is still not stable. It is planned to replace a disk shelf on Monday and we hope that that will stabilize the system.

By now we know that the repeated hardware failures have caused filesystem corruption that cannot be repaired. That is, we will not be able to restore about 3400 to 7000 files in the /global/scratch file system (detailed explanation below). These files will be lost permanently. At this point we have pursued almost all possibilities to recover as many files as possible and have now decided to bring the system back up as soon as possible without further time consuming attempts to recover some more files.

Detailed (technical) description of the file system corruption: The filesystem checks revealed that there are about 3400 blocks that are claimed to belong to more than one file. This cannot be the case. This is a very rare case of filesystem corruption and basically means that out of the files that claim the same block there is one file that is the original owner of the block whereas the other files are "imposters". One option of repairing the situation is to run a filesystem check that copies the block for every file that claims that block. This guarantees that the file that is the proper owner of the block will get restored. All "imposters" cannot be restored no matter what. The problem is that such a filesystem check is very, very slow: our estimate is that it would run for more than one month. We have decided that we cannot keep the Bugaboo system offline for such a long time. Therefore, we are now planning to bring the system online without attempting to restore the 3400 blocks to their proper owners. This means that between 3400 and 7000 files will be lost. We are still waiting on instructions from our supporting vendor on how to proceed so that we can get the filesystem back online. The detailed procedure will determine how many files will be lost in the end.

We apologize for the long downtime and the loss of files and hope that the system will become available again within the week of Jan. 30 - Feb. 3.

2017-01-19 - 19:05 PST Offline

Bugaboo filesystem problems, Jan 19

The filesystem problems on the Bugaboo storage facility have reappeared. The supporting vendor is working on a solution. Until a resolution of the problem is found the Bugaboo system is unavailable.

2017-01-12 - 21:59 PST Online

System fully operational

Finished on January 13, 2017 - 5:59 GMT

2017-01-12 - 09:47 PST Testing

Bugaboo file system problem

Due to the failure of a storage controller, bugaboo Lustre filesystem is not available until further notice.  Vendor is reviewing the issue.

Update: The filesystem problems have been resolved for now and the system is available. However, the vendor is scheduled to come onsite to replace one of the controllers.

2017-01-10 - 21:29 PST Online

Bugaboo file system problem

Due to the failure of a storage controller, bugaboo Lustre filesystem is not available until further notice.  Vendor is reviewing the issue.

Pages