Skip to main content

February 2021 maintenance: All ACCRE systems are back online

Posted by on Friday, December 11, 2020 in Website.

Update, 2/12: We have finished all the work in scope for the planned 3-day downtime and restored the network storage service. All our production and ACCREx systems are back online. New hardware for /scratch has been installed and full network redundancy between the rooms in the data center has been reestablished.

We did not shutdown Slurm, so all jobs submitted prior to the downtime were preserved in the queue which has resumed scheduling.

DORS users: you may need to reload your network mappings or reboot your device to reconnect to the storage system.

In order to avoid having two general downtimes in quick succession, ACCRE started both the GPFS fix as well as the planned maintenance items at noon on Wednesday, Feb 10th.

In weighing the options given, this is the quickest path to resolving system issues. Since Monday, we have seen the storage problems begin to affect other critical systems resulting in a steady degradation of compute capacity and user experience.

This will impact all gateways, storage systems, and compute. The four main work items in scope are:

Please plan accordingly and contact us via our helpdesk if you need any assistance. Additional reminders will be sent out prior to the downtime.

Updated to reflect rescheduling. Last update February 12.