July 10, Bugaboo /home filesystem problems

We continue to have problems with the /home filesystem on the Bugaboo cluster that become apparent through spurious error messages "Cannot write: No space left on device".

This is due to a bug in the filesystem software. We are in contact with the vendor, but no fix is available so far. The vendor suspects that this bug is triggered by certain jobs running on the cluster, but we have not been able to isolate those jobs. In the meantime we know how to fix this problem in the short term (i.e., until the systems runs into this bug again). Thus, continue to email support@westgrid.ca when you run into this problem.

The /global/scratch file system is not affected by this bug so far. Thus, using /global/scratch instead of /home presents a workaround for this problem. A word of caution though: there is no reason to assume that the same bug could not be triggered on /global/scratch as well. The fact that we have not seen the problem in /global/scratch probably just indicates that the jobs that trigger the bug use the /home filesystem and not /global/scratch.