Info on /scratch outages from 4/14 and 4/23
Update, 4/23 10pm: The
/scratch storage subsystem is recovered. You may resume any affected jobs and report any issues via our helpdesk.
Update, 4/23 7pm: The
/scratch storage subsystem was unable to gracefully handle the controller failure and the remaining controller began losing connections to some of the LUNs. We are doing a clean shutdown of that system making all of
/scratch unavailable. We will begin its recovery as soon as we check all the components.
Update, 4/23 6pm: One of the controllers in the
/scratch storage subsystem is reporting offline. We are investigating if the second controller is handling the load and checking for any possible impacts to jobs.
Update, 4/14 3pm: The
/scratch storage service is recovered and remounted across the cluster. One of the controllers was incorrectly reporting some of the disk arrays as offline. Please feel free to resume usage and report any issues you may encounter via our help desk.
We have had reports and are seeing issues with the
/scratch storage service. The main symptom users will notice are messages indicating file corruption. This is due to I/O failures when trying to read data off one of the logical unit numbers (LUNs) allocated to
/scratch that shows as unavailable. Failure to connect to that LUN triggers the file corruption message.
We are working to stabilize that LUN and will likely need to unmount
/scratch across the cluster to recover the system and prevent actual bad data from being written.