[Resolved] Configuration setting to be applied Tuesday to resolve GPFS lagginess
Update, 7/30/2020: We have identified, and tested, a configuration setting that decreases the time it takes to resolve user and group names by an order of magnitude. According to the vendor we can apply the new configuration without downtime.
We plan on performing this configuration update after hours next Tuesday, Aug 4th.
This new configuration addresses the most widely reported latency issue (i.e. delay in resolving users and groups). This is an issue exclusive to those using VUNetID credentials to access their data. We continue to investigate another issue related to delays with I/O operations and will post updates once we have identified the cause.
There is an issue in one of our shared storage sub-systems that is causing considerable lag, with write operations sometimes taking up to 2 minutes. This seems to be isolated to the hardware hosting accounts migrated from DORS to ACCRE. We do not have reports of this affecting
/data. Our team has also determined that:
- this is experienced by both remote users and users in the cluster
- the lag is not caused by GPFS long-waiters (i.e. the most common cause of GPFS slow downs)
The next step is to identify which layer of the storage system is causing the bottleneck so that we can reproduce the issue on-demand. Currently we are working with the vendor on two other issues that are reproducible (related to permissions) and that are blocking active research. There is a chance that these are related issues and solving one will move us towards resolution on the others.