Author
Cluster access restored following downtime on May 18-20
May. 3, 2021—The May 2021 scheduled downtime has concluded. Updates on the subsequent storage issues have been moved to a separate post. Update, 5/21/2021 5pm: While we wait for the results of the vendor’s analysis of the system events, we are restoring job submissions to the cluster so they can run over the weekend. Jobs will be...
Info on /scratch outages from 4/14 and 4/23
Apr. 14, 2021—Update, 4/23 10pm: The /scratch storage subsystem is recovered. You may resume any affected jobs and report any issues via our helpdesk. Update, 4/23 7pm: The /scratch storage subsystem was unable to gracefully handle the controller failure and the remaining controller began losing connections to some of the LUNs. We are doing a clean shutdown...
Ask ACCRE: How do I map my home directory on ACCRE as a network drive on my computer?
Apr. 12, 2021—By mapping ACCRE as a networked drive, we can move files from our computer to ACCRE, and from ACCRE to our computer, as if it was a drive on our computer. Here are instructions for mapping ACCRE as a networked drive on Windows and macOS. Mapping ACCRE to a computer running Windows We will be...
Improvements to the ACCRE onboarding process, including training classes offered on demand
Mar. 18, 2021—Starting today, our training classes for new users for ACCRE will be offered in a new online, on-demand format hosted on RedCap. This applies to all three classes – Intro to Unix, Intro to the Cluster and Intro to Slurm – and will replace the three day series of classes we have held in the...
/data and /home interruption, MATLAB jobs interrupted
Feb. 23, 2021—Update, 2/24: In a separate issue we have updated the license file for MATLAB and a couple of jobs have probably died in the process. Please check the output of your MATLAB jobs and try again if this was the case. The cluster experienced a momentarily loss in connectivity to /home and /data at approximately 4:07pm...
Networked storage refusing to connect
Feb. 8, 2021—Further updates will be added to the downtime announcement. Update, 2/9 1pm: In order to avoid having two general downtimes in quick succession, ACCRE will begin both the GPFS fix as well as the planned maintenance items at noon tomorrow, Feb 10th. Please plan accordingly and open a helpdesk ticket if you need any assistance...
ACCRE networked storage connectivity
Dec. 11, 2020—Update, 12/12: We were able to identify the source of the problem that was preventing the export services from operating normally. An update was applied just afternoon today and that has cleared the errors that were causing the NFS service to crash. The workstations of one of the groups that use the service got into...
February 2021 maintenance: All ACCRE systems are back online
Dec. 11, 2020—Update, 2/12: We have finished all the work in scope for the planned 3-day downtime and restored the network storage service. All our production and ACCREx systems are back online. New hardware for /scratch has been installed and full network redundancy between the rooms in the data center has been reestablished. We did not shutdown...
Storage issues with /scratch and networked storage
Nov. 20, 2020—At around 10am this morning we received alerts for the /scratch storage sub-system and subsequently for the networked storage sub-system. /scratch unmounted on 57 compute nodes and 20 GPU nodes as well as the gateways. An investigation of all three sub-systems (/data, /scratch, networked storage) showed that a few LUNs were unavailable due to three...
[Resolved] Downtime for /scratch storage on Wednesday 11/4 from 6am to 2pm CT
Oct. 27, 2020—Update, 11/4 2pm: The /scratch storage was successfully repaired during the exceptionally scheduled downtime this morning. Please feel free to resume activities that use /scratch and report any issues you may find. One of the GPFS disk groups of the /scratch storage had a drive failure, which is not unusual. We replaced the failed drive...