Skip to main content

[Resolved] Tape recovery on files impacted by May 29 disk failure is now complete

Posted by on Monday, July 23, 2018 in Cluster Status Notice.

Update, 7/23/2018: After nearly two months, the tape recovery on /data and /scratch is virtually complete and we will go ahead and mark this as resolved. This has to do with the logical disk failure on May 29 that caused 5% of files on /data and /scratch to become unavailable. It is unrelated to the sluggishness on those two directories.
Please open a helpdesk ticket if you are unable to access files on /data and /scratch that were impacted by the disk failure. Thank you for your patience!

Update, 7/18/2018: Tape recovery resumed following /data and /scratch sluggishness; is nearly complete
We have resumed tape recovery on files that were inaccessible following the logical disk failure on May 29. We expect it to finish in the next few days.

Update, 7/3/2018: Tape recovery on hold due to /data and /scratch sluggishness, but nearly complete
The tape recoveries are very close to being done. We had to stop the last few as a result of the /scratch and /data sluggishness to avoid stressing the system.

Update, 6/25/2018: Tape recovery ongoing
By the end of the week, we expect to have completed restores of files impacted by the hardware failure that occurred several weeks ago. As before, please let us know if you need any impacted files urgently so we can prioritize them.
If next week you are still seeing I/O errors on files in /scratch or /data (or if you find zero-byte files that you believe should contain data), please open a helpdesk ticket with us. It is possible that your file(s) may not be recoverable if they were created shortly before (e.g. the day of) the hardware failure, but we will double check if requested.

Update, 6/13/2018, 6PM: The cluster is now accessible again. It appears that jobs accessing data from /dors were not impacted, but those accessing data on /home, /scratch, or /data were likely lost. Some may have been automatically restarted by SLURM.

We have resumed restoring files resulting from the hardware failure.

Please reach out to us if you have any concerns, questions, or pressing needs.

Update, 6/13/2018: The hardware vendor fixed the bad firmware issue on the storage array last night. This morning, however, when we attempted to bring the logical volumes back online in GPFS the command hung.

This has impacted the ability to log in to the cluster and to run commands accessing home directories. We are working with IBM support to resolve this issue and will update you as we have more information.

Update, 6/12/2018: Tape recovery of /scratch and /data ongoing; some files will be unavailable from 4:30pm this afternoon until 8am tomorrow morning

The hardware vendors were unable to recover the logical volume in their last attempt. We will continue to restore files from our tape library as quickly as we can.

This afternoon beginning at 4:30PM we will be bringing down five additional logical volumes (same ones as last week) so engineers from the hardware vendor can correct a failed firmware patch they attempted to apply to one of the controllers in the impacted storage appliance. As a result, you may lose access to some files on /scratch and /data until tomorrow morning at 8AM. I apologize for the short notice, but we only recently received a response from the vendor and this work is urgent as we only have a single functioning controller within that storage appliance at the moment.

Update, 6/8/2018: Logical volume with /scratch and /data files unable to be restored; tape recovery ongoing

Unfortunately the hardware vendor was again unable to restore the logical volume. It’s possible they may try one last time next week but at this point we are operating under the assumption that all impacted data needs to be restored from our tape library.

We have already restored a large number of files and will continue to do so over the weekend. We will send an update early next week. In the meantime, please let us know if you need access to any critical files that were impacted so we can prioritize that restore (to the best of our ability).

Next week we also plan to work with the vendor to address the faulty hardware / firmware to ensure this problem does not occur again.

Please reach out if you have any questions or concerns.

Update, 6/7/2018: 5% of files on /scratch and /data still unavailable, more files will be offline starting at 4:30pm and going overnight

The hardware vendor was again unsuccessful last night recovering the logical volume. They are going to try one last time tonight. This means that the other five logical volumes will again be going offline around 4:30PM today.

While we are hopeful engineers are able to resolve the problem tonight, we are prepared to be told that the logical disk is unrecoverable. Fortunately, all data are backed up to tape and since early this week we have already been restoring impacted files.

The total amount of impacted data is on the order of 65TB (/scratch and /data is close to 1PB total), however it will require reading 216TB of tape in order to recover these data. We expect this process will take 1-2 weeks. If you have data you need urgently, please let us know and we will do our best to prioritize these recoveries, although we are at times limited depending on exactly how and where the data are stored in our tape library.

Update, 6/6/2018: 5% of files on /scratch and /data still unavailable, more files will be offline starting at 4:30pm and going overnight

Engineers from the hardware vendor worked overnight on this issue and have identified a bug in the storage array’s controller that they believe to be the root cause of the issues that began last week. They will be working again tonight in attempt to correct this bug.

The five logical volumes within this storage array that were not impacted by this bug are back online today for read access only (any writes of new files or modifications to existing files will automatically be moved to other volumes on other storage arrays).

During the maintenance tonight these disks will again be brought offline, so older files that already existed on these disks will again be inaccessible.

We will be taking these five volumes offline this afternoon beginning at 4:30PM, and plan to bring them back online tomorrow morning around 8AM.

We again apologize for this interruption to your research. We understand this is frustrating and can assure you we are doing all we can. Please reach out if you need special assistance of any kind.

Update, 6/5/2018: Maintenance on /scratch and /data starting at 8pm tonight to attempt fix on disk issue; request a ticket if you need access to an affected file

Unfortunately we have been unable to restore the logical disk used by /scratch and /data on the ACCRE cluster. Under guidance from the hardware vendor we will be performing maintenance tonight in an attempt to restore the disk. This will require us to temporarily take additional logical disks offline, so more files on /scratch and /data may be unavailable between 8PM tonight and 8AM tomorrow morning. We apologize if you are impacted by this work.

Update, 6/3/2018: 5% of files on /scratch and /data are inaccessible due to disk failure; request a ticket if you need access to an affected file

We have continued working on this hardware problem throughout the weekend. Specifically, we have made several attempts (with guidance from the hardware vendor) to make the logical disk visible to one or both controllers within the storage appliance, but to no avail. At the direction of the vendor, we have updated the firmware on the impacted device in order to produce additional logging information, which, we hope, will lead to a diagnosis of the problem.

We have also determined that roughly 5% of all files on /scratch and /data are impacted. Please reach out to us via our Helpdesk if you have critical files impacted by this hardware problem.

Based on interactions with the vendor, it appears unlikely this issue will be fixed today or even within the next few days, so please reach out to us if you have been impacted and need assistance of any kind.

We are very sorry for the inconvenience and the impact on your research that this may be causing.

Update, 6/1/2018: Parts of /scratch and /data may be inaccessible due to disk failure

We have been working with the vendor in an attempt to resolve the hardware problem of an unrecognized logical disk as quickly as possible. Unfortunately parts of scratch and data are still inaccessible. The symptom of this problem will in most cases be errors like “input/output error” when attempting to read data from /scratch or /data. We will keep everyone updated as we continue working to try to diagnose and resolve this problem.

If you are working under a tight deadline and the inaccessibility of a critical file is preventing you from moving forward, please open a Helpdesk ticket with us. In some cases we may be able to help.

Original post, 5/29/2018:

At approximately 8:30pm tonight a disk controller in one of our storage arrays failed. The array automatically failed over to the redundant controller but one of the six logical disks was not recognized by the redundant controller.

The logical disk that is not accessible right now is part of the /scratch and /data filesystem.  Therefore, while the cluster is still up some files stored on /scratch and /data may be inaccessible until further notice.  We are actively working with the hardware vendor to diagnose the problem and replace any faulty hardware.