Cluster Status Notice
ACCRE cluster and visualization portal are back online
Jan. 13, 2020—Update, 1/21/2020 10am: The cluster and visualization portal are back online. Further updates will be posted here. Update, 1/20/2020 4pm: Tomorrow we will start providing limited cluster availability. The current bottleneck is determining which files in /scratch have missing blocks. This process requires intensive scanning of all metadata and places a heavy burden on the old...
Shared storage is now fully online
Dec. 9, 2019—Update, 12/12/2019: We have restored full access to the shared storage system after working with our vendors to identify the problem. All users whose files were affected by the 5% of the storage that was unavailable should be able to access all their files without issues. While the service has been restored, there are still some...
DORS offline through the weekend for maintenance; some processes and nodes on the cluster were affected by DORS unmount
Nov. 8, 2019—Update, 11/11/2019: DORS is now back online. ”The filesystem check finished and the filesystem has been repaired. We learned from this that 8 files were lost during the problematic disk rebuild. We are running a process now that should identify these files so they can be restored from backup. This process can be run while...
Custom/private gateway work is complete
May. 1, 2019—Update, 5/10: The work is now complete. On Friday, May 10 beginning at 8:30AM we will be taking custom/private gateways offline in order to reboot and upgrade the operating system to CentOS 7.6 from 7.4 and the GPFS filesystem to 5.0.2 from 4.2.3. We expect these upgrades to last roughly 1-2 hours. If this time...
GPFS outage resolved; check output of running jobs
Apr. 19, 2019—Around 2PM today a GPFS manager node had an issue and caused the GPFS filesystem’s to hang across the cluster, making logins and file access unresponsive. The issue was corrected at 3PM today and all compute nodes seem to have recovered. Please check the output of running jobs just to be safe.
Scheduled maintenance on public gateways complete
Apr. 18, 2019—Original post: This Saturday morning, April 20th, we will be taking the public gateway and portal servers offline from 7 am to 9 am in order to reboot and upgrade the operating system to CentOS 7.6 from 7.4 and the GPFS filesystem to 5.0.2 from 4.2.3. Updating the entire cluster to GPFS 5 is an...
ACCRE networking problems fixed; make note of rules of thumb when reading or writing data to the cluster
Apr. 9, 2019—Update, 4/10/2019: Early this morning we applied some changes that appear to have resolved the network stability issues we were having yesterday. Feel free to resume normal activities on the cluster. We apologize for the interruption! On a related note, we have been observing intermittent sluggishness on /scratch and /data over the last several weeks....
[Resolved] Visualization portal maintenance Saturday morning is now complete
Mar. 14, 2019—Update, March 16: This maintenance is now complete. The ACCRE Visualization Portal will go down for scheduled maintenance on Saturday, March 16th, from 6 AM to 10 AM. This will only affect web access through the Visualization Portal, so users may still run jobs on the cluster and login through the gateway nodes via SSH....
[Resolved] SLURM scheduler is back online following outage
Mar. 5, 2019—Update, 3/5/2019: The scheduler is now operational. The impact on the cluster queue has been minimal. We are investigating to establish the exact reason of the stuck jobs in order to prevent this to happen again. Thank you for your patience. We are currently experiencing a SLURM overload caused by issues in killing processes related...
[Resolved] /scratch and /data are back online following weekend maintenance
Jan. 24, 2019—Update, 2/12/2019: /scratch and /data are back online and we are now accepting new jobs. We were never able to get the maintenance command to run successfully, but we were able to verify (with IBM’s assistance) the integrity of /scratch and /data, which is great news and means we will not need to take another...