[Resolved] Brief cluster maintenance for Monday, July 23rd has been cancelled
Update, 7/22/2018: We are canceling the maintenance described below. Since this email was sent, we learned that a drive in one of our storage appliances was in a bad state (not bad enough to trigger our monitoring but bad enough to impact performance). On Saturday morning we proactively failed over to a spare drive and began the process of rebuilding the storage array that the bad drive was involved in. This process completed a few hours ago, and we are hopeful it will remedy the intermittent problems with sluggishness that we have been experiencing on /scratch and /data recently.
If the problem persists, it is very likely that we will need to reschedule the maintenance described below for later in the week.
We will continue to monitor the situation closely and keep you updated.
Original post: Brief cluster maintenance Monday, July 23rd at 6:30 AM
The maintenance work which we performed this past Sunday (July 15th) appears to have solved the issue with one of the two storage arrays that have been causing the significant performance impacts to our /scratch and /data filesystem.
However, one storage array is still experiencing issues despite the work we did this past Sunday … as many of you are probably well aware, unfortunately. We are continuing to work with the hardware vendor on this issue and they have requested that we swap the controllers around in the unit to try to pinpoint the hardware component that is the root cause of the issues.
Doing so requires taking the storage array entirely down for a period of approximately 30 minutes. We are scheduling this work for Monday, July 23rd, beginning at 6:30 AM. Prior to beginning the work we will stop all I/O to this storage array.
Any jobs or interactive sessions on cluster gateways that are attempting to do I/O during this time frame to either /scratch or /data will fail. However, as with the work we performed last Sunday, any jobs accessing /home or /dors should not be impacted by this work.
We have chose early Monday morning because we believe this time will have the least impact on jobs. Once the maintenance work is complete we will resume I/O to the storage array.
We apologize for the interruption to your research and appreciate your understanding as we take these necessary measures to work towards resolving these issues. Thank you…