Skip to main content

Maintenance on GPFS scheduled for early Monday morning; /data and /scratch unavailable for read tasks

Posted by on Monday, July 2, 2018 in Active Cluster Status Notice, Cluster Status Notice.

Update, 9/18/2018: This upcoming Monday (Sept 24) beginning at 4:30AM we will be upgrading the cache on our GPFS storage appliances. We expect this maintenance to last between 3 – 4.5 hours. During this time, reads from /scratch or /data may fail. We are performing the maintenance early in the morning to minimize impact as much as possible; however, any running jobs that are doing active reading from /scratch or /data during the maintenance window may fail.

I/O on /home and /dors will not be affected. SLURM jobs will continue to run so long as they are not reading data from /scratch or /data.

For an update on the status of GPFS storage at ACCRE, please continue reading:

In addition to the hardware problems we experienced early in the summer, in recent months we have also observed that the volume of writes to /scratch and /data have increased to the point where they’re overrunning the capacity of the cache in some of our storage arrays, which hasn’t happened in the past except in very rare circumstances.  We have doubled the cache in one of our storage appliances already, and have observed improvements in performance as a result of this upgrade, so we are hopeful that these additional upgrades will help further improve /scratch and /data responsiveness when under heavy load.

We also placed a relatively large hardware order this past week that will allow us to (i) increase the capacity of /scratch and /data, (ii) lifecycle some older hardware, and (iii) improve the performance on /scratch and /data. This hardware should arrive within the next 1-2 months and we hope to have these new machines incorporated into our environment by the end of the calendar year.

We know that many have been frustrated by intermittent filesystem (/scratch and /data in particular) sluggishness over the last several months, and understandably so. Thank you for bearing with us as we work to expand, adapt, and design storage platforms that are performant, cost effective, and more reliable.


Update, 9/14/2018: /data and /scratch sluggishness

This week we successfully upgraded the cache modules in one of our storage appliances. We have ordered additional modules (which we expect to arrive next week) to upgrade the remaining appliances. We are hopeful that these upgrades will help improve the consistency of /scratch and /data responsiveness when under heavy load.

In addition, we also placed a relatively large hardware order this past week that will allow us to (i) increase the capacity of /scratch and /data, (ii) lifecycle some older hardware, and (iii) improve the performance on /scratch and /data. This hardware should arrive within the next 1-2 months and we hope to have these new machines incorporated into our environment by the end of the calendar year.


Update, 9/5/2018: In addition to the hardware problems we experienced early in the summer, we have also observed that the volume of writes to /scratch and /data have increased to the point where they’re overrunning the capacity of the cache in some of our storage arrays, which hasn’t happened in the past except in very rare circumstances. We will be testing this hypothesis very carefully over the next few weeks.

Update, 8/20/2018: The sluggishness on /data and /scratch returned over the weekend. We are in the process of trying to determine why.

Update, 8/13/2018: /data and /scratch back online and hardware maintenance complete; performance is being monitored
Performance seems better so far since the work last week, but we won’t know for sure until the system is under heavier load.

Update, 8/9/2018, 2pm: A short while ago we resolved the issue preventing the disks from being brought back online in GPFS.  You should now be able to access your /scratch and /data directories.
The issue turned out to be neither with the hardware nor with GPFS, but rather with the part of the Linux operating system that allows redundant access to the disks from the GPFS servers.
We will be carefully monitoring /data and /scratch to see if the maintenance we did this morning may have fixed the hardware issue with one storage array that originally prompted todays’ events.
Thank you for your patience…

Update, 8/9/2018: /data and /scratch offline

The maintenance scheduled for this morning has hit an issue and, unfortunately, /scratch and /data are still inaccessible at this point. We have engaged with the vendor and have assigned the ticket their highest priority.

The hardware maintenance on the storage array was completed successfully. The issue we are having is with bring the disks back online in GPFS.

We apologize for the interruption to your research…


Update, 8/6/2018, 2pm: /data and /scratch sluggishness: Maintenance on /data and /scratch this Thursday, 6-7AM

This Thursday, Aug 9 between 6-7AM we will be performing maintenance on a storage device used by /scratch and /data. During this time, attempted reads or writes from/to this device will fail. We have suspended writes to the device for now to reduce the chances of attempted reads or modifications to files on this device on Thursday morning. However, any attempted reads from /scratch or /data or modifications to existing files on /scratch or /data may fail between 6-7AM on Thursday.
This is the same storage device we have been working on for the past month or two that we believe is the cause of intermittent sluggishness on /scratch and /data. Because we (and the vendor) suspect a hardware issue of some sort, any maintenance requires us to take the entire unit offline. There are two separate maintenance tasks we will be performing on Thursday in an attempt to correct the sluggishness, or at least to further isolate the root cause.
Any jobs accessing /home or /dors will not be impacted by this work, but SLURM jobs accessing /scratch or /data will be at risk.

Update, 8/6/2018: Performance is better this morning, but the issues come and go. We have retained our original storage capacity prior to the logical volume failure, and will be rebalancing data this week for improved performance.

Update, 8/1/2018: We are still experiencing some intermittent sluggishness and continuing to work with the vendor to eliminate and identify potential causes.


Update, 7/26/2018: /data and /scratch sluggishness; SAMBA back online

We have brought our SAMBA server back online this morning, so SAMBA users should now be able to mount their shares again.

We have also completed some critical data rebalancing to alleviate limited capacity issues on /scratch and /data we were experiencing over the weekend and early in the week. Capacity was limited because of the logical volume problem that occurred last month (the volume contained 30 TB of data) and was exacerbated by large restores of data from files impacted by the logical volume issue. We also suspect this limited capacity was contributing to the sluggishness on /data and /scratch.

Performance on /scratch and /data appears better this morning, but we are not ready to declare victory yet. We are continuing to work with the storage hardware vendors to determine whether there are additional factors at play, and we are monitoring the system closely.

Over the next week, we will be rebuilding the problematic logical volume and will eventually reincorporate it into our GPFS23 filesystem. Note that the problem was not with the logical volume itself but rather a bug in a controller’s firmware (which has since been upgraded in all controllers within our GPFS storage appliances) that impacted the logical volume.

Please let us know if you have any questions or concerns.


Update, 7/23/2018: /data and /scratch sluggishness; SAMBA down for maintenance

Our SAMBA server is currently down for maintenance as we attempt to make progress with /scratch and /data sluggishness. We expect to have it back online in the next day or two. Users will be unable to map/mount their SAMBA shares to their local machines until the server is back online.


Update, 7/18/2018: /data and /scratch are still sluggish despite this past Sunday’s maintenance work. We will announce another maintenance period soon in the hope of fixing this issue.


Update, 7/6/2018, 5PM: One last update before the weekend…

We made a lot of good progress today and have determined that there is indeed a hardware issue at play. Specifically, there are significantly longer response times from one of our storage appliances due to a hardware malfunction of some sort. The storage is still functioning, albeit at a slower rate than it should. We have a ticket open with the vendor and have let them know what we have determined and are awaiting their response.
Due to the complexity of our data storage environment, the issues with this single storage device have caused similar slowdowns to other devices, which led us to initially believe that the problem was not specific to a single storage device. However, via exhaustive troubleshooting throughout the week we have determined the issue appears to in fact be caused by a single storage device.
For those impacted by this sluggishness, it does appear that response times have improved some over the last 2-3 days, but we understand it’s frustrating, especially in light of the frequent hardware issues we have experienced recently. It’s extremely frustrating for us as well, and please rest assured that we are doing everything in our power to stabilize our cluster storage so you all can continue working at a more reasonable rate.

Update, 7/6/2018: The intermittent sluggishness described below is still impacting /scratch and /data on the ACCRE cluster. At this point, we have narrowed the possible causes to one of the following (or combination of both):

  • Low-level hardware problem
  • User processes abusing /scratch and/or /data
We are continuing to work with our hardware vendors to diagnose a possible hardware issue, and we are also in the process of enhancing our monitoring to determine whether the second possibility is in play. If you are running new processes or jobs that are I/O intensive on /scratch or /dors, it would be helpful if you reached out to us proactively so we can inspect these processes and rule them out as culprits.

Update, 7/4/2018: Unfortunately we don’t have much new information to share this morning. The intermittent sluggishness on /scratch and /data continues despite our best efforts. We will continue to keep everyone posted.


Update, 7/3/2018: We are still battling with this problem this morning. We have attempted several measures to isolate the issue but have yet to determine the root cause.

We again apologize for the interruption and will keep you updated with any new information.


Original post:

We have been experiencing intermittent sluggishness on /scratch and /data throughout the day, and are actively investigating possible causes.

We apologize for the delay to your work.

Leave a Response

You must be logged in to post a comment