Sai Medury joins ACCRE as Associate System Administrator
Nov. 10, 2021—Sai Medury joined ACCRE in Nov 2021. He helps configure, test, and troubleshoot systems at ACCRE. He also works closely with the Centre for Structural Biology (CSB) and helps maintain CSB systems at ACCRE. He is currently completing his Ph.D. in Computational Science from the University of Tennessee at Chattanooga and has a Master of...
/scratch restored following outage this morning
Nov. 1, 2021—Update, 11/1/2021 3pm: The /scratch storage sub-system has been remounted across the cluster and the public gateways. There are a few custom gateways that will need to be rebooted and that will be coordinated with the respective groups. One of the components for the /scratch storage sub-system entered a bad state over the weekend. Our...
Storage issues: cluster available for normal use
Oct. 3, 2021—Update, 10/25/2021: The cluster will be be available for normal use at 10:30am this morning. The system remained stable and error free over the weekend. We were also able to catch up on tape backup operations. Update, 10/22/2021 1pm: Based on input we received from the vendor last night and comparing the available options, we...
New software stack “2020b” is available on the cluster
Sep. 27, 2021—The new software stack 2020b is available on the ACCRE cluster. It includes the following features: GCC version 10.2 New Intel compiler and MKL libraries etc. released in 2020 R 4.0.5 Python 3.8.6 and Python 2.7.18 SciPy-bundle/2020.11 (which includes the newer version of numpy) etc. The 2020b software stack is built based on the GCC...
ACCRE Downtime scheduled for September 17-18 has been completed
Aug. 27, 2021—Update, 9/18: We have finished all the work within scope of this scheduled downtime and successfully completed all the system tests. All ACCRE systems are now available. You can monitor system availability here and please report any odd behaviors via our helpdesk. Our next scheduled downtime will exceptionally only last 2 days and fall on...
ACCRE welcomes Mark Keever, executive director of research IT for Vanderbilt
Aug. 4, 2021—Mark Keever is the Executive Director of Research IT for Vanderbilt University. Mark comes to Vanderbilt from Oregon State University, where he was the Director of Digital Research Infrastructure, and Co-PI on an NSF cyberinfrastructure datacenter. Prior experience from Georgia Tech contributes to his 20 years of research computing experience.
May/June 2021 storage issues resolved; /scratch and Slurm are back to normal use
Jun. 1, 2021—We have verified that /scratch is performing at the same level it was prior to the down time. You may resume using it as you normally would as well as submit jobs to Slurm without using the special AtRiskUnstableEnvironment reservation. Prior to the downtime /scratch was operating with only 1 of the 2 controllers in...
Some programs on “maxwell” and “pascal” GPU nodes may need to be upgraded following May downtime
May. 4, 2021—We are going to upgrade the underlying operating system on the GPU nodes in the next downtime two weeks from now. As we tested we found that the OpenMPI associated with GCC 5.4 (OpenMPI/1.10.3) does not work with the newly installed driver, so if the system gets upgraded then the 1.10.3 version of the MPI...
Cluster access restored following downtime on May 18-20
May. 3, 2021—The May 2021 scheduled downtime has concluded. Updates on the subsequent storage issues have been moved to a separate post. Update, 5/21/2021 5pm: While we wait for the results of the vendor’s analysis of the system events, we are restoring job submissions to the cluster so they can run over the weekend. Jobs will be...
Info on /scratch outages from 4/14 and 4/23
Apr. 14, 2021—Update, 4/23 10pm: The /scratch storage subsystem is recovered. You may resume any affected jobs and report any issues via our helpdesk. Update, 4/23 7pm: The /scratch storage subsystem was unable to gracefully handle the controller failure and the remaining controller began losing connections to some of the LUNs. We are doing a clean shutdown...