/data and /scratch Sluggishness
Update, 7/6/2018, 5PM: One last update before the weekend…
We made a lot of good progress today and have determined that there is indeed a hardware issue at play. Specifically, there are significantly longer response times from one of our storage appliances due to a hardware malfunction of some sort. The storage is still functioning, albeit at a slower rate than it should. We have a ticket open with the vendor and have let them know what we have determined and are awaiting their response.
Due to the complexity of our data storage environment, the issues with this single storage device have caused similar slowdowns to other devices, which led us to initially believe that the problem was not specific to a single storage device. However, via exhaustive troubleshooting throughout the week we have determined the issue appears to in fact be caused by a single storage device.
For those impacted by this sluggishness, it does appear that response times have improved some over the last 2-3 days, but we understand it’s frustrating, especially in light of the frequent hardware issues we have experienced recently. It’s extremely frustrating for us as well, and please rest assured that we are doing everything in our power to stabilize our cluster storage so you all can continue working at a more reasonable rate.
Update, 7/6/2018: The intermittent sluggishness described below is still impacting /scratch and /data on the ACCRE cluster. At this point, we have narrowed the possible causes to one of the following (or combination of both):
- Low-level hardware problem
- User processes abusing /scratch and/or /data
We are continuing to work with our hardware vendors to diagnose a possible hardware issue, and we are also in the process of enhancing our monitoring to determine whether the second possibility is in play. If you are running new processes or jobs that are I/O intensive on /scratch or /dors, it would be helpful if you reached out to us proactively so we can inspect these processes and rule them out as culprits.
Update, 7/4/2018: Unfortunately we don’t have much new information to share this morning. The intermittent sluggishness on /scratch and /data continues despite our best efforts. We will continue to keep everyone posted.
Update, 7/3/2018: We are still battling with this problem this morning. We have attempted several measures to isolate the issue but have yet to determine the root cause.
We again apologize for the interruption and will keep you updated with any new information.
We have been experiencing intermittent sluggishness on /scratch and /data throughout the day, and are actively investigating possible causes.
We apologize for the delay to your work.