Skip to main content

[Resolved] Some compute nodes offline for rebooting; /data and /scratch maintenance complete

Posted by on Monday, July 9, 2018 in Cluster Status Notice.

Update, 7/15/2018: The reboot is nearly complete; we will go ahead and mark this as resolved.

Update, 7/15/2018: The maintenance is complete and /scratch and /data are available on all gateways and most of the rest of the cluster.  However, there are a moderate number of compute nodes that will need to be rebooted to get the filesystem remounted.  We have marked those nodes offline and will reboot them over the next few days as currently running jobs complete.
Thank you for your patience and understanding as we performed this critical maintenance work.

Original post:
This Sunday we will be performing maintenance on the storage devices hosting data on /scratch and /data. We will begin on Sunday morning at 8:00 AM and plan to complete the maintenance late Sunday evening. During this time, /scratch and /data will be inaccessible (including via SAMBA), and any attempted I/O to or from /scratch or /data will fail. 
/home and /dors will remain available throughout the weekend, and SLURM jobs will not be impacted (so long as they are not attempting I/O to or from /scratch or /data).
During this maintenance we will be applying critical firmware upgrades to each of the controllers within our GPFS storage appliances. These upgrades are needed to avoid the recent problems we have been having with our storage hardware. The upgrades will also help the hardware vendors more easily diagnose problems when they occur (e.g. the occasional sluggishness we began experiencing last week).
We apologize for the short notice but these are critical upgrades to ensure better reliability of ACCRE storage. We chose Sunday as a time that would minimize interruption to your work as much as possible.
Please let us know if you have any questions or concerns.