Cluster access restored following downtime on May 18-20
Update, 5/21/2021 5pm: While we wait for the results of the vendor’s analysis of the system events, we are restoring job submissions to the cluster so they can run over the weekend. Jobs will be scheduled if they are expected to complete before 8am on Monday. A couple of things to keep in mind:
- There’s still a possibility GPFS may unmount
/datadue to the unknown condition
- ACCRE will likely need to schedule a downtime to perform troubleshooting next week
We continue to monitor storage and cluster behavior and will provide updates if we notice any new issues.
Update, 5/21/2021 2pm: The cluster has been reopened with restricted access until 6pm. All gateways and filesystems are available, but computational jobs can not be submitted. We expect to block access to the cluster and take GPFS offline this evening in order to actively troubleshoot the system using diagnostic information from the vendor. This may involve taking DORS offline also.
We will provide another update around 5pm regarding the timing and if DORS availability will be impacted or not.
Update, 5/21/2021 11am: We are working with the vendor to determine which troubleshooting path to take. There’s a possibility that we may need to take all storage sub-systems offline at some point, but we will communicate that and a
timeline as soon as we know.
Update, 5/21/2021: One of the tasks in scope for the GPFS maintenance included upgrading the OS on the servers in order to keep them in step with supported GPFS versions (i.e. if the operating system becomes too outdated we cannot update the GPFS software). Early this afternoon we updated two of the servers and noticed unexpected instability in the mounts for
/home. Further investigation showed that certain administrative commands and a few user commands could result in the compute nodes and gateways losing those mounts. Unfortunately this event does not generate any error messages in the logs or terminal session and it is unclear if it is related to the OS updates. We are therefore extending the downtime through noon tomorrow while we gather information and work with the vendor to identify what is triggering this condition.
The Scheduled Downtime dates for the Spring have moved up from June 1-3 to May 18-20. This is to reduce the chances that
/scratch storage sub-systems will experience major issues related to controller failure. By replacing the controllers that are reporting problems from each system, we will restore the hardware and connectivity redundancy needed for full functionality.
The other maintenance tasks that will be performed during this downtime include:
- upgrade software on
- configure dual-fiber connectivity to the group storage servers
- relocate existing GPFS hardware in accordance with updated data center layout
- upgrade LStore storage servers
- upgrade Linux on custom gateways and group storage servers
This Scheduled Downtime will not include the DORS storage sub-system. DORS will continue to be available on those dates.
In case you haven’t seen it yet, we have a status page for our systems available.