Planned downtime scheduled for Aug 16-18th
Jul. 28, 2022—Our next scheduled downtime will start in the early morning on August 16th and finish late on August 18th. The main tasks in scope for this downtime include: Replacing key network equipment (spine and possibly external) Replacing a failed controller in DORS Software upgrades Changes to network services We expect all systems to be unavailable...
Updated – new issue with /scratch (June 20th)
Jun. 16, 2022—One of the automated tasks the storage hardware performs to correct drive issues did not terminate properly and put /scratch in a bad state. This had a domino affect and triggered additional issues as it tried to recover. We have taken /scratch offline again in order to clear the bad state of one of the...
MATLAB license update on May 26th, during the updating process the MATLAB license program will be restarted
May. 25, 2022—We are going to update the license for MATLAB on the license server. This renewal is performed annually. In this process we are going to restart the license daemon so it may cause MATLAB jobs to fail.
Slurm maintenance on 1PM to 3PM in April 28th
Apr. 27, 2022—ACCRE recently experienced an issue with slurm, hence we plan an urgent maintenance on slurm tomorrow April 28th between 1PM to 3PM. In the time window we will bring down the slurm controllers and the slurmd services and do maintenance. We expect the maintenance will not hurt the submitted jobs, but users will experience error...
MATLAB license file to be updated; some jobs may fail
Jan. 28, 2022—We are going to install MATLAB 2021a and 2021b onto ACCRE, which will require us to update the license file on the license server. In the process we are going to restart the license daemon so it may cause some MATLAB jobs to fail.
ACCRE systems back online early from scheduled downtime
Jan. 6, 2022—Update, 1/12: Our systems are back online and ready for use. Please open a support ticket if you notice any issues. We will be taking our systems offline for our scheduled downtime on January 11-13th. This will include all gateways, all storage systems, and all compute clusters. The scope of the work includes: firmware updates...
I/O delays affecting jobs resolved; Visualization Portal back online
Dec. 10, 2021—Update, 12/13/2021: These issues have been resolved. Please open up a helpdesk ticket if you run into any issues. Update, 12/10/2021: The Visualization Portal is being taken offline for emergency preventive security updates. You can check the status of the service here. We have received reports of compute jobs failing due to I/O delays during...
Sai Medury joins ACCRE as Associate System Administrator
Nov. 10, 2021—Sai Medury joined ACCRE in Nov 2021. He helps configure, test, and troubleshoot systems at ACCRE. He also works closely with the Centre for Structural Biology (CSB) and helps maintain CSB systems at ACCRE. He is currently completing his Ph.D. in Computational Science from the University of Tennessee at Chattanooga and has a Master of...
/scratch restored following outage this morning
Nov. 1, 2021—Update, 11/1/2021 3pm: The /scratch storage sub-system has been remounted across the cluster and the public gateways. There are a few custom gateways that will need to be rebooted and that will be coordinated with the respective groups. One of the components for the /scratch storage sub-system entered a bad state over the weekend. Our...
Storage issues: cluster available for normal use
Oct. 3, 2021—Update, 10/25/2021: The cluster will be be available for normal use at 10:30am this morning. The system remained stable and error free over the weekend. We were also able to catch up on tape backup operations. Update, 10/22/2021 1pm: Based on input we received from the vendor last night and comparing the available options, we...