Skip to main content

Shared storage is now fully online

Posted by on Monday, December 9, 2019 in Cluster Status Notice.

Update, 12/12/2019: We have restored full access to the shared storage system after working with our vendors to identify the problem. All users whose files were affected by the 5% of the storage that was unavailable should be able to access all their files without issues.

While the service has been restored, there are still some changes that
need to be made to the hardware that were not performed because they require additional downtime. Scheduling for this downtime will be discussed next week at the monthly FAB meeting and a notification will follow.

Update, 12/9/2019: After the system recovery on Thursday, the shared storage management system disconnected one of the disk volumes used by /scratch and /data. We were able to verify that the data on the disk volume is intact, but have not been able to force the storage system to reconnect the volume. The vendor is analyzing our diagnostic data/logs and will provide options on ways to resolve the issue.

This is only affecting users whose files happen to be stored on that disk volume, which represents less than 5% of the system.

Update, 12/6/2019: One of the storage servers in the shared storage cluster started having errors last night after the system came back online. Users who have a portion of their files on that server are experiencing problems ranging from missing files to what look like corrupted files. We are currently working with the vendor on ways to restore the health of the server. We will send out another update once we finish the first try at recovery.

Update, 12/5/2019 7pm: The storage is back online and the cluster is available for use. Please contact us if you need further details. Happy computing!

Update, 12/5/2019 3pm: We are still resolving issues in the storage network which prevents the storage service from stabilizing. The re-cabling is finished and are checking to make sure the storage management servers can access all their needed resources.

We realize that time is critical, especially for end-of semester projects and apologize for the disruption.

Update, 12/5/2019 11am: During the recovery of the network, we’ve noticed that there is still some instability around the connectivity between some of the storage servers and we are actively troubleshooting. For those who may have already regained access to your files, please hold off on resubmitting any jobs or starting any work until further notice.

Update, 12/5/2019 10am: The network device that disrupted access to the storage system was identified and recovered. As the storage nodes reconnect, your files will become available. If you do not see your files within the next hour, please open a help desk ticket and we will investigate the issue.

Original post:

The /home, /data, and /scratch have unmounted across the cluster. It looks like the shared storage started to have issues early this morning and while we were investigating the cause the directories unmounted
around 8:15am.

There is no loss of data, just connectivity to the storage service. This will likely impact almost all jobs.

We are currently checking the servers and network to isolate the problem. Updates will be provided as troubleshooting progresses.