Skip to main content

Network issues resolved and network storage stabilized; emergency update to GPFS applied

Posted by on Sunday, September 20, 2020 in Cluster Status Notice.

9/22 3:05pm: ACCRE’s networked storage (aka DORS) is stabilized. We are working with the hardware vendor to determine why one of the controllers for that subsystem performed a service halt instead of a graceful fail over to
the redundant controller. Please submit a helpdesk ticket for any new issues you might experience.


9/22 9:26am: This morning at 8:54 our monitoring indicated that the storage subsystem for ACCRE’s networked storage (aka DORS) unmounted across the cluster. From what we can tell this is related to the GPFS update to the
controllers (topic of one of Sunday’s announcements).  We are in the process of recovering the system and investigating if this was a full or partial impact.


9/21 11:50am: All services have been restored and cleared for use. We are bringing the compute nodes back online in groups so as to not slam SLURM as cluster activity ramps up. This means cluster jobs may spend a little longer in the waiting queue before they get scheduled. If you are still experiencing any issues please report those via the ACCRE helpdesk page on our website.


9/21 12:53am: Connectivity between the two server rooms in the data center has been re-established. We would like to thank the VUIT NOC staff for their rapid response and cooperation throughout the evening. ACCRE staff has begun reviewing the extent of the network impact on our systems and is working to make sure services are responsive. We thank you for your patience and we will provide a final update once we’ve checked all systems. Until then, users may experience intermittent issues in the event some services need to be restarted in order to recover.


9/21 12:18am: Just after 2am this morning we experienced a partial network issue that has severely limited connectivity between critical services in the ACCRE environment. Components of the storage service are among those impacted, which in turn creates other issues (i.e. login slowness or failures, nodes draining). We have identified what we believe to be the network segments with the problem and are troubleshooting the related network hardware.


9/20 8:11pm: We are going to apply an emergency update to GPFS tonight. This will cause a disruption to ACCRE‘s networked storage (aka DORS) and is needed to re-establish fault tolerance in one of the disk arrays. This work will be done in parallel with the replacement of a network switch that failed this morning. Please check the website for updates related to that issue.


9/20 7:08pm: We will substitute a key piece of network equipment to reestablish a connection between the two server rooms in the data center. Both the primary and secondary switches that handle the traffic will not reliably resume operation. We expect to have the replacement hardware up and running within the next 2 hours. We apologize for the interruption and will send out an announcement when that work is done.


9/20 8:47am: Just after 2am this morning we experienced a partial network issue that has severely limited connectivity between critical services in the ACCRE environment. Components of the storage service are among those impacted, which in turn creates other issues (i.e. login slowness or failures, nodes draining). We have identified what we believe to be the network segments with the problem and are troubleshooting the related network hardware.