Skip to main content

[Resolved] Full ACCRE cluster downtime on Friday, April 20 at 6pm, continuing through weekend

Posted by on Friday, April 6, 2018 in Cluster Status Notice.

Update, 4/22/2018: We’re happy to report that the cluster is now back online and available for general use.

The filesystem check and repair went very smoothly. Only one additional, noncritical file (making a total of three) was identified by the scan as being corrupt, and the subsequent repair corrected all three of these files.
We still have ~50 compute nodes that are offline (out of ~600), but we will be taking care of these this afternoon and expect to be at full computing capacity by tomorrow morning.

Original post:

Earlier this week we discovered two corrupt files on the GPFS22 filesystem (which contains ACCRE /home directories). The files themselves were noncritical. However, it is possible there are others that have not yet been identified.

We immediately reached out to IBM, and unfortunately the only way to assess the full extent of the damage, and to correct any damage, is to run a full filesystem check, which requires the entire cluster to be offline.

We are declaring a full cluster downtime beginning on Friday, April 20 at 6PM and lasting through the weekend. During this time, the cluster (including normal and custom gateways, file access, SLURM, SAMBA access) will be inaccessible.

To retain access to important files (including those accessed via SAMBA) you should copy these files off of the cluster prior to April 20.

A SLURM reservation has been put in place to ensure that all compute nodes are drained of jobs prior to 6PM on April 20. As a result, users will lose the ability to run longer jobs (i.e. jobs requesting a wall time that would have them ending after 6PM on the 20th) as April 20 approaches.

If you mount a DORS share to your local machine, this will not be impacted, but DORS will not be available from the ACCRE cluster.

We apologize for the interruption this may cause, but this is critical maintenance work that must be performed to ensure the integrity of all data stored in ACCRE home directories. In the event that additional corrupt files are discovered and cannot be repaired, we will restore these files from backups on tape.

We will send an update once cluster access has been restored. We cannot predict exactly how long the maintenance will take. It is possible that access will be restored sometime on Sunday (4/22), but we have a reservation in place through Monday (4/23) morning.

Please reach out to us if you have any questions or concerns.