May/June 2021 storage issues resolved; /scratch and Slurm are back to normal use

We have verified that /scratch is performing at the same level it was prior to the down time. You may resume using it as you normally would as well as submit jobs to Slurm without using the special AtRiskUnstableEnvironment reservation.

Prior to the downtime /scratch was operating with only 1 of the 2 controllers in one of the main servers. Despite replacing the failed controller during the downtime, the system would sporadically lose connectivity to several of its LUNs at which point some files would be reported as inaccessible or corrupted. After much troubleshooting with the vendor, they determined there was a bug in the controller and, as of Tuesday, they did not have a fix. We remounted /scratch with a single controller that same day and it has since passed all of our tests.

We continue to work on a redesign that will allow us to move off the buggy hardware and benefit from component redundancy at all levels – disk, controllers, and servers.

Update, 6/1/2021: We are not satisfied with the results of the remediation work performed on the /scratch GPFS storage sub-system. After in-depth troubleshooting from our team and analysis from the vendor, /scratch is operational but we still see some inconsistencies that prevent a full validation of the system’s health.

/scratch has been remounted across the cluster as of this morning. We have two ongoing efforts to improve the situation. The first is that we are initiating a sequence of operations that will run over the next 3 days to determine our level of confidence in /scratch for long-term usage. Please continue to use the special reservation for job submission. The second is that we are taking advantage of the situation to redesign both architectural and performance aspects of /scratch. We expect to have additional details about that by the end of next week.

Please consider moving critical files to /data. If your group’s quota limit prevents you from doing so, have your PI or lab manager submit a helpdesk ticket outlining the need.

Update, 5/27/2021 5pm: We have finished our troubleshooting for the part of GPFS that handles /data and /home and those resources are now remounted. All diagnostics indicate that those GPFS subsystems are health and we will continue to monitor them closely over the break. You may continue to use the cluster and submit jobs with the Slurm directive #SBATCH –reservation=AtRiskUnstableEnvironment. /scratch continues to be unavailable pending further analysis.

Update, 5/27/2021: Our next steps for troubleshooting GPFS will require us to take /home and /data offline. We plan to do that at 1pm and hope to have those back online by 5pm. /scratch will continue to be unavailable.

In the meantime, please make sure and grab any files you may need or gracefully halt any cluster jobs.

Update, 5/26/2021: We are working with our vendor to resolve the stability issue of the /home and /data filesystem and expect that this may take more days to troubleshoot. During this time, we cannot guarantee the stability of these filesystems but understand that there may be an urgent need to perform computation on the ACCRE cluster. Additionally, and due to a completely separate hardware failure which a different vendor is troubleshooting, /scratch will likely be unavailable.

During this time you may submit jobs with the slurm directive #SBATCH –reservation=AtRiskUnstableEnvironment, and we will keep nodes available to run these jobs. Please note that jobs may halt unexpectedly due to filesystem issues, and we cannot make any guarantee for any data written during this time. Additionally, we may not be able to troubleshoot issues with jobs made under this reservation, as we are focused on the primary task of restoring service. These jobs must have a duration such that they will complete by June 6th or they will not be scheduled. /dors and /data continue to be backed up.

In summary:

  • if you have to submit jobs use the slurm directive #SBATCH –reservation=AtRiskUnstableEnvironment
  • /dors is available
  • /home and /data are available but may be impacted
  • /scratch is expected to have outages

Update, 5/25/2021: The cluster has been operating without Slurm scheduling on the cluster since last Friday while we worked with the vendors to schedule active troubleshooting. Our hardware vendor indicated they are ready to proceed and we have begun to unmount /scratch for troubleshooting. The gateways, DORS, and /data continue to be available.


For earlier updates, see: May 2021 Downtime