[Resolved] /data and /scratch sluggishness
Update, 10/1/2018 2pm: The cluster is back up again.
Update, 10/1/2018: The cluster is down due to today’s maintenance. See: Cluster Inaccessible
Update, 9/27/2018: We will be performing cache upgrades of our GPFS storage appliances on Monday, Oct 1 beginning at 4:30AM. We expect the maintenance to last between 3 – 4.5 hours. During this time, reads from /scratch or /data may fail.
Update, 9/23/2018: Due to staff illness we are cancelling tomorrow morning’s maintenance. I will send an update when this maintenance has been rescheduled. We are tentatively planning on a week from tomorrow (Oct 1).
Update, 9/18/2018: Maintenance on GPFS scheduled for early Monday morning; /data and /scratch unavailable for read tasks
This upcoming Monday (Sept 24) beginning at 4:30AM we will be upgrading the cache on our GPFS storage appliances. We expect this maintenance to last between 3 – 4.5 hours. During this time, reads from /scratch or /data may fail. We are performing the maintenance early in the morning to minimize impact as much as possible; however, any running jobs that are doing active reading from /scratch or /data during the maintenance window may fail.
I/O on /home and /dors will not be affected. SLURM jobs will continue to run so long as they are not reading data from /scratch or /data.
For an update on the status of GPFS storage at ACCRE, please continue reading:
In addition to the hardware problems we experienced early in the summer, in recent months we have also observed that the volume of writes to /scratch and /data have increased to the point where they’re overrunning the capacity of the cache in some of our storage arrays, which hasn’t happened in the past except in very rare circumstances. We have doubled the cache in one of our storage appliances already, and have observed improvements in performance as a result of this upgrade, so we are hopeful that these additional upgrades will help further improve /scratch and /data responsiveness when under heavy load.
We also placed a relatively large hardware order this past week that will allow us to (i) increase the capacity of /scratch and /data, (ii) lifecycle some older hardware, and (iii) improve the performance on /scratch and /data. This hardware should arrive within the next 1-2 months and we hope to have these new machines incorporated into our environment by the end of the calendar year.
We know that many have been frustrated by intermittent filesystem (/scratch and /data in particular) sluggishness over the last several months, and understandably so. Thank you for bearing with us as we work to expand, adapt, and design storage platforms that are performant, cost effective, and more reliable.
Update, 9/14/2018: /data and /scratch sluggishness
This week we successfully upgraded the cache modules in one of our storage appliances. We have ordered additional modules (which we expect to arrive next week) to upgrade the remaining appliances. We are hopeful that these upgrades will help improve the consistency of /scratch and /data responsiveness when under heavy load.
In addition, we also placed a relatively large hardware order this past week that will allow us to (i) increase the capacity of /scratch and /data, (ii) lifecycle some older hardware, and (iii) improve the performance on /scratch and /data. This hardware should arrive within the next 1-2 months and we hope to have these new machines incorporated into our environment by the end of the calendar year.
Update, 8/9/2018: /data and /scratch offline
The hardware maintenance on the storage array was completed successfully. The issue we are having is with bring the disks back online in GPFS.
We apologize for the interruption to your research…
Update, 8/6/2018, 2pm: /data and /scratch sluggishness: Maintenance on /data and /scratch this Thursday, 6-7AM
Update, 8/1/2018: We are still experiencing some intermittent sluggishness and continuing to work with the vendor to eliminate and identify potential causes.
Update, 7/26/2018: /data and /scratch sluggishness; SAMBA back online
We have brought our SAMBA server back online this morning, so SAMBA users should now be able to mount their shares again.
We have also completed some critical data rebalancing to alleviate limited capacity issues on /scratch and /data we were experiencing over the weekend and early in the week. Capacity was limited because of the logical volume problem that occurred last month (the volume contained 30 TB of data) and was exacerbated by large restores of data from files impacted by the logical volume issue. We also suspect this limited capacity was contributing to the sluggishness on /data and /scratch.
Performance on /scratch and /data appears better this morning, but we are not ready to declare victory yet. We are continuing to work with the storage hardware vendors to determine whether there are additional factors at play, and we are monitoring the system closely.
Over the next week, we will be rebuilding the problematic logical volume and will eventually reincorporate it into our GPFS23 filesystem. Note that the problem was not with the logical volume itself but rather a bug in a controller’s firmware (which has since been upgraded in all controllers within our GPFS storage appliances) that impacted the logical volume.
Please let us know if you have any questions or concerns.
Update, 7/23/2018: /data and /scratch sluggishness; SAMBA down for maintenance
Our SAMBA server is currently down for maintenance as we attempt to make progress with /scratch and /data sluggishness. We expect to have it back online in the next day or two. Users will be unable to map/mount their SAMBA shares to their local machines until the server is back online.
Update, 7/18/2018: /data and /scratch are still sluggish despite this past Sunday’s maintenance work. We will announce another maintenance period soon in the hope of fixing this issue.
Update, 7/6/2018, 5PM: One last update before the weekend…
Update, 7/6/2018: The intermittent sluggishness described below is still impacting /scratch and /data on the ACCRE cluster. At this point, we have narrowed the possible causes to one of the following (or combination of both):
- Low-level hardware problem
- User processes abusing /scratch and/or /data
Update, 7/4/2018: Unfortunately we don’t have much new information to share this morning. The intermittent sluggishness on /scratch and /data continues despite our best efforts. We will continue to keep everyone posted.
Update, 7/3/2018: We are still battling with this problem this morning. We have attempted several measures to isolate the issue but have yet to determine the root cause.
We again apologize for the interruption and will keep you updated with any new information.
We have been experiencing intermittent sluggishness on /scratch and /data throughout the day, and are actively investigating possible causes.
We apologize for the delay to your work.