February 2021 maintenance: All ACCRE systems are back online
Update, 2/12: We have finished all the work in scope for the planned 3-day downtime and restored the network storage service. All our production and ACCREx systems are back online. New hardware for /scratch
has been installed and full network redundancy between the rooms in the data center has been reestablished.
We did not shutdown Slurm, so all jobs submitted prior to the downtime were preserved in the queue which has resumed scheduling.
DORS users: you may need to reload your network mappings or reboot your device to reconnect to the storage system.
In order to avoid having two general downtimes in quick succession, ACCRE started both the GPFS fix as well as the planned maintenance items at noon on Wednesday, Feb 10th.
In weighing the options given, this is the quickest path to resolving system issues. Since Monday, we have seen the storage problems begin to affect other critical systems resulting in a steady degradation of compute capacity and user experience.
This will impact all gateways, storage systems, and compute. The four main work items in scope are:
- Hardware updates for
/scratch
storage - Finish repairing the intra-datacenter connection that failed in September
- Update servers to CentOS 7.9 and build new Lmod
- Fix this week’s degradation to cluster performance related to the networked storage services (aka DORS)
Please plan accordingly and contact us via our helpdesk if you need any assistance. Additional reminders will be sent out prior to the downtime.
Updated to reflect rescheduling. Last update February 12.