Skip to main content

Storage issues: cluster available for normal use

Posted by on Sunday, October 3, 2021 in Cluster Status Notice.

Update, 10/25/2021: The cluster will be be available for normal use at 10:30am this morning. The system remained stable and error free over the weekend. We were also able to catch up on tape backup operations.

Update, 10/22/2021 1pm: Based on input we received from the vendor last night and comparing the available options, we have physically moved the 60 disks of the affected LUNs out of the disk enclosure that was causing the issues this morning. They have been installed and configured in open slots on the other disk enclosures that are visible to the associated storage sub-system. All of the transplanted LUNs are visible and looking good. The only exception is the LUN damaged in the Oct 8th outage, for which we are 96% through restoring from tape backup.

Our plan is to perform additional tests and diagnostics and to reopen the ACCRE environment at 2pm without Slurm usage. You will be able to access your data, but if you have large files you may experience errors since those are more likely to be part of the 4% still needing to be restored. We will provide another update once we determine when cluster jobs can be resumed.

Additionally, we will resume the restore from backup that had been running since Oct 11th and then perform a fresh backup of /home and /data.

Update, 10/21/2021 7pm: As of this evening we are continuing to explore options with the hardware vendor to safely restore the system without the defective hardware. We will give another update tomorrow once we have more information from the vendor.

Update, 10/21/2021 10am: After mitigating the hardware problems yesterday evening, the hardware for one of the disk enclosures (JBOD) caused several of its LUNs to get flagged as broken. This put the storage system at risk of broad failure. We are taking the storage offline to prevent further issues. This will be followed by moving the affected LUNs to other disk enclosures in order to vacate the problematic disk enclosure. This should allow the storage to come back online while we work with the vendor to replace the
bad hardware.

Update, 10/21/2021: We plan on shutting gpfs52 off again in order to physically move the storage to another jbod in order to isolate the hardware that’s having issues while we try to get a replacement. We will post more details shortly.

Update, 10/20/2021 5pm: We apologize for the delay in getting the update out. The storage is again mounted across the cluster. We are monitoring all activity to see if the steps we took will successfully prevent further issues and are working with our vendor to have parts on hand in case we need to do additional hardware maintenance.

Update, 10/20/2021 3pm: The storage mounts for /home and /data held for 2hrs than began to gradually disconnect across the cluster. We are performing a controlled unmount on the rest of the systems followed by a full cycle and log capture of the associated GPFS sub-system. At this point we suspect this is related to the hardware in one of the disk enclosures. We will provide another update by 3:30pm.

Update, 10/20/2021: During the last 70 mins we’ve detected a series of system incidents that indicate the unavailability of /data and /home in the cluster environment. We have cycled the related GPFS servers and forced a remount on all connected nodes. We are also looking through the logs to see if this is related to the hardware recovery effort from the recent outage. Please check your jobs and/or files and report any issues via our helpdesk.

Update, 10/19/2021 12pm: We are over 90% complete with tape restoration on /data.

Update, 10/19/2021: We have completed tape restoration for damaged files in /home and we are two thirds of the way there for /data. We will keep you updated as the tape recovery progresses.

Update, 10/12/2021: The process for restoring the damaged files in /home and /data is still underway. We expect all the ones for /home to be finished by late afternoon tomorrow. We don’t yet have a clear ETA on the /data files, but a rough estimate would be sometime on Friday.

We greatly appreciate your patience and please open a ticket if you have a time sensitive matter.

Update, 10/8/2021: After consultation with the hardware vendor, it was determined that recovery of the third and final disk group would take longer than restoring the affected files from tape backup and may not be guaranteed to succeed. Therefore, we will proceed with data recovery from this disk group from backup and have restored general cluster access to /home and /data today. Please note that approximately 3.4% of all files are currently missing from the filesystem temporarily and will be restored from tape backup.

We have provided a list of temporarily unavailable files in a text file on the cluster at /gpfs52/files-to-be-restored.txt

This is a large file, so if you wish to check only the files in your home directory that were affected you can use the following command to list them, replacing VUNETID with your ACCRE username:

grep /home/VUNETID /gpfs52/files-to-be-restored.txt

Once the tape restore is progressing, we will provide further information about the restored files as well as some tools to help users move recovered files back into their directories on /home or /data as desired.

We will continue to monitor the system this morning and plan on resuming pending Slurm jobs at 2:00 pm. If you have currently scheduled jobs, you may wish to check your /home or /data directories for missing files that may impact your jobs and then suspend or cancel them until the restore completes as needed.

We apologize for the impact that this unscheduled outage has had on your research and appreciate your patience as we have worked with the hardware vendor to diagnose and recover from this hardware fault.

Update, 10/6/2021: We have successfully repaired two of the disk groups (i.e. LUNs) without data loss and are working on the third and final disk group. All of the individual drives report as healthy and we continue to work with the vendor to restore system stability.

Update, 10/4/2021: We are still working with the vendor to resolve the hardware issue related to the GPFS storage sub-system. We will send out another update once we’ve begun testing the fix and hope to have an estimated time of availability at that point.

At approximately 6PM Saturday the disk hardware that /data and /home utilize started having stability issues which degraded over the evening. Over the next several hours ACCRE staff worked to isolate the problem to no avail. At this point we have engaged the hardware vendor to continue troubleshooting the problem. At this time /data and /home are unmounted everywhere.