Networked storage refusing to connect
Further updates will be added to the downtime announcement.
Update, 2/9 1pm: In order to avoid having two general downtimes in quick succession, ACCRE will begin both the GPFS fix as well as the planned maintenance items at noon tomorrow, Feb 10th. Please plan accordingly and open a helpdesk ticket if you need any assistance preparing.
In weighing the options given, this is the quickest path to resolving system issues. Since yesterday, we have seen the storage problems begin to affect other critical systems resulting in a steady degradation of compute capacity and user experience.
Update, 2/9: Troubleshooting continues on the GPFS system and we have submitted additional snapshots of other aspects of the system to the vendor for analysis. We have also presented some of the options being considered to ACCRE’s Faculty Advisory Board as well as their implications.
We apologize for the delay in the update, but were hoping to have conclusive results to share.
Update, 2/8 1pm: This morning we received the first analysis results of the snapshot from the vendor and are reviewing network and quota messages from specific time frames. Additional analysis by the vendor and proper chronology of events will lead to the root cause. A quick summary of the findings so far include:
- The current issue does not present the same patterns as the RPC storm from December that disabled the export services (i.e. the 3 CES servers)
- There were no changes to the GPFS configuration other than quota updates (which are performed almost daily)
- At least 2 of the error messages point to known bugs in GPFS addressed in an April 2020 update
- The system reports a deadlock condition that persists despite having restarted the export services on the CES servers
Our goal is to work with the vendor to avoid a shutdown of the entire storage system ahead of next week’s scheduled downtime. Files are still accessible via cluster access and sshfs
, but your experience may vary due to ongoing troubleshooting. We will provide another update later this afternoon.
Original post:
Earlier today we noticed a degradation in the storage sub-system related to the networked storage services (aka DORS). While investigating and monitoring the situation we saw the 3 servers that handle the networked storage connections (i.e. SMB and NFS) become unreachable around 6pm. This in effect severed any active connection of those types. The GPFS sub-system reports that it needs to take a snapshot before further troubleshooting can take place. We have initiated the snap and have engaged vendor support. An update will be provided as soon as an analysis of the snap is done and the next steps are determined.
We did verify that files in /dors
directories are still accessible via the cluster (i.e. logging into a gateway or the Visualization Portal) and mountable using sshfs
connections for those who have ACCRE credentials. However, it is possible that you may experience some sluggishness due to the snapshot activity happening in the background.