[Resolved] Downtime for /scratch storage on Wednesday 11/4 from 6am to 2pm CT
Update, 11/4 2pm: The /scratch
storage was successfully repaired during the exceptionally scheduled downtime this morning. Please feel free to resume activities that use /scratch
and report any issues you may find.
One of the GPFS disk groups of the /scratch
storage had a drive failure, which is not unusual. We replaced the failed drive with a new drive which should have kicked off a RAID rebuild. However, the storage server is not detecting the new drive and will not rebuild the RAID. After troubleshooting with the vendor, we determined we need to halt activity on that server in order to upgrade the firmware and rebuild the RAID6. This is scheduled to be done next week on November 4th starting at 6am. We expect to have /scratch remounted on the cluster by 2pm.
If you, or your cluster jobs, need access to any of your files on /scratch
during that time, please copy them over to a different storage system.
A few important items to note:
- This will not impact the use of
/data
storage or networked storage (aka DORS) - This has not resulted in any data loss. The disk groups are configured to survive multiple disk failures, we just need to restore the affected disk group to full health.
- Files on
/scratch
are not backed up.