Skip to main content

Cluster access restored following downtime on May 18-20

Posted by on Monday, May 3, 2021 in Website.

The May 2021 scheduled downtime has concluded. Updates on the subsequent storage issues have been moved to a separate post.

Update, 5/21/2021 5pm: While we wait for the results of the vendor’s analysis of the system events, we are restoring job submissions to the cluster so they can run over the weekend. Jobs will be scheduled if they are expected to complete before 8am on Monday. A couple of things to keep in mind:

  • There’s still a possibility GPFS may unmount /home and /data due to the unknown condition
  • ACCRE will likely need to schedule a downtime to perform troubleshooting next week

We continue to monitor storage and cluster behavior and will provide updates if we notice any new issues.

Update, 5/21/2021 2pm: The cluster has been reopened with restricted access until 6pm. All gateways and filesystems are available, but computational jobs can not be submitted. We expect to block access to the cluster and take GPFS offline this evening in order to actively troubleshoot the system using diagnostic  information from the vendor. This may involve taking DORS offline also.

We will provide another update around 5pm regarding the timing and if DORS availability will be impacted or not.

Update, 5/21/2021 11am: We are working with the vendor to determine which troubleshooting path to take. There’s a possibility that we may need to take all storage sub-systems offline at some point, but we will communicate that and a
timeline as soon as we know.

Update, 5/21/2021: One of the tasks in scope for the GPFS maintenance included upgrading the OS on the servers in order to keep them in step with supported GPFS versions (i.e. if the operating system becomes too outdated we cannot update the GPFS software). Early this afternoon we updated two of the servers and noticed unexpected instability in the mounts for /data and /home. Further investigation showed that certain administrative commands and a few user commands could result in the compute nodes and gateways losing those mounts. Unfortunately this event does not generate any error messages in the logs or terminal session and it is unclear if it is related to the OS updates. We are therefore extending the downtime through noon tomorrow while we gather information and work with the vendor to identify what is triggering this condition.

The Scheduled Downtime dates for the Spring have moved up from June 1-3 to May 18-20. This is to reduce the chances that /data and /scratch storage sub-systems will experience major issues related to controller failure. By replacing the controllers that are reporting problems from each system, we will restore the hardware and connectivity redundancy needed for full functionality.

The other maintenance tasks that will be performed during this downtime include:

  • upgrade software on maxwell and pascal GPU nodes
  • configure dual-fiber connectivity to the group storage servers
  • relocate existing GPFS hardware in accordance with updated data center layout
  • upgrade LStore storage servers
  • upgrade Linux on custom gateways and group storage servers

This Scheduled Downtime will not include the DORS storage sub-system. DORS will continue to be available on those dates.

Also, we have posted information on how to map your ACCRE storage to your desktop/laptop for those who would like an easier way to move files between work environments.

In case you haven’t seen it yet, we have a status page for our systems available.

Also, our three required training classes are now available online and can be taken at anytime.