[Resolved] /scratch and /data are back online following weekend maintenance
Update, 2/12/2019:
/scratch and /data are back online and we are now accepting new jobs.
We were never able to get the maintenance command to run successfully, but we were able to verify (with IBM’s assistance) the integrity of /scratch and /data, which is great news and means we will not need to take another scheduled downtime in the near future.
Update, 2/11/2019, 3pm:
We apologize for the delay but we are still working with IBM to try to salvage some useful information from the failed command so that it can be diagnosed and fixed. If they cannot come up with anything by the end of the day (6-8pm tonight), we will give up and attempt to bring /scratch and /data back online tonight.
Update, 2/11/2019:
/scratch and /data are still offline this morning but we are aiming to have them online before lunchtime. Please read below for more details.
Long story short, a GPFS maintenance command we were attempting to run this weekend to ensure the continued integrity and reliability of /scratch and /data failed to complete successfully. We worked with IBM support engineers throughout the weekend to run this command in multiple different ways, but the command ultimately failed in all cases, and in one case took down /home in the process of failing. It appears we have hit some sort of edge case bug within GPFS that is triggering this failure. We are running this command one last time this morning with tracing enabled, with the hope that this will provide IBM with the information they need to fix the bug we are hitting. We will need to take yet another scheduled downtime in the not-so-distance future to attempt the process again once the bug has been fixed by IBM and we have deployed the fix on our GPFS software.
We apologize for the inconvenience. We’re frustrated, as I’m sure many of you are. We have plans this Winter/Spring to move to a newer version of GPFS (version 5) that promises better performance and reliability. This will also be run with a new storage appliance that has a very large amount of cache space for better performance.
Please let us know if you have questions or concerns via our helpdesk. We will send another update once /scratch and /data are online again.
Update, 1/28/2019:
We plan on performing maintenance on /scratch and /data on Saturday, Feb 9 beginning at 8am. Due to the size of /scratch and /data, this maintenance will likely last through Sunday and potentially spill over to Monday. Please plan to be without access to data on /scratch or /data for at least Feb 9-10 and copy data off that you need access to during that time. Jobs accessing /scratch or /data during this maintenance will be impacted.
The maintenance work on /home is complete and we have resumed compute nodes to run jobs once again. Thank you for your patience as we performed this critical work.
Please reach out with any questions or concerns via our helpdesk.
Original post on /home maintenance on Jan 26:
This Saturday (Jan 26) beginning at 8am we will be performing emergency maintenance on /home. We expect the maintenance to last through Saturday afternoon or evening sometime. During this time, /home will be inaccessible, and jobs attempting I/O to or from /home will be impacted. Jobs accessing /scratch, /data, or /dors will not be impacted. No new jobs will run during the maintenance. Please migrate data off of /home if you wish to retain access to the data on Saturday.
We apologize for the late notice, but this is important work that is necessary to maintain the performance and reliability of /home.
We will send a notice on Saturday afternoon or evening when /home is available again.
Please reach out with any questions or concerns via our helpdesk.