Skip to main content

Cluster access and visualization portal are down to restore files from backup following hardware failure; no data loss to /home; /data to be restored from tape

Posted by on Monday, January 13, 2020 in Cluster Down or Unavailable, Cluster Status Notice.

Update, 1/16/2020 10pm: The new shared storage system is in place and we have been restoring files from backup for the last 24hrs at approximately 1GB/s. On Monday we will determine if we can go ahead and provide access to /home. Additional information will be provided regarding the remaining time on the restore and how to access files on the old storage.

Update, 1/14/2020 4pm: During the recovery process for the shared storage (GPFS) yesterday we had another drive failure in the same RAID group. With three drives down it is impossible to recover that disk group. The impact is a loss of about 6% of the files.

Due to these circumstances, we have decided go ahead and replace the shared storage system with completely new hardware. Our team began preparing the new system last night and we expect to start migrating user files soon. We will have a tentative estimate for completion once the data transfers are going. The new system’s network connections are 2.5x faster and uses an optimization policy for staging smaller files on the SSDs.

Files located in /home will be copied over from the old system to the new one. There was no data loss to /home. Files from /data will be restored from tape. While we don’t backup /scratch, we plan on making the old /scratch available to users in the chance they are able to copy files over to their new /home, /data, or /scratch.

We regret that this happened this way, but are looking forward to having the new hardware in place and providing a much needed boost to the cluster.

Update, 1/13/2020 5pm: The recovery process for the shared storage (GPFS) is still underway. There are two steps to the process and our plan was to bring the cluster back online after the first step was done. We are still waiting for the
first step to conclude and are reevaluating this strategy in favor of fully recovering the storage before connecting it to the cluster. Our hardware vendor is helping to assess the current state in order to mitigate other potential risks and recommended next steps.

Regarding the scheduler, testing has gone well and you may notice activity on the graphs on the ACCRE website. This just means that SLURM is functional and is running. Looking at the graphs you can see it’s not using the full cluster.

We will be providing another update before lunch tomorrow and apologize for the disruption.

Update, 1/13/2020 10pm: During a check of the shared storage in preparation for bringing the cluster backup up, we detected that there was a RAID set reporting two bad disk drives and another reporting a single bad drive. Those have been replaced and the storage system has started rebuilding the RAID set. We are going to allow that process to finish before bringing the cluster back online. We expect to be able to do that tomorrow morning.

Update, 1/13/2020 4pm: The new SLURM version has been built and installed. The next phase will be testing. So far things look good and we hope to have the cluster restored later this evening.

As part of the effort to make sure the cluster is restored to good health, we are also checking the shared storage system (i.e. GPFS) for any possible residual issues caused by SLURM’s erratic behavior. We will send out another notice once SLURM testing has concluded.

Update, 1/13/2020 2pm: We’ve disabled access to the ACCRE gateways in order to resolve issues with GPFS and SLURM that started over the weekend. Access to the ACCRE Visualization Portal is also down. We apologize for the interruption to your research and hope to bring back access to the cluster soon.

Update, 1/13/2020 9am: This morning the vendor sent us a patched version for SLURM and we are in the process of building, deploying, and then testing it. We will send out an update as soon as testing is completed.

Update, 1/12/2020 9pm: The scheduler software for the cluster, SLURM, continues to have issues and we continue to troubleshoot the problem along with vendor support. Many of the jobs that had already started prior to SLURM going off-line are still running and may complete. However, users may experience difficulty logging into the gateways due to strain on the shared storage system.

Original post:

The cluster’s GPFS file system is extremely overloaded, and at the same time the SLURM scheduler failed overnight and can’t be restarted. We’ve opened a ticket with SchedMD to resolve this issue.

We are currently rebooting nodes that are inaccessible to stabilize GPFS.

We apologize for any inconvenience to your research.

Leave a Response

You must be logged in to post a comment