Storage issues with /scratch and networked storage
At around 10am this morning we received alerts for the
/scratch storage sub-system and subsequently for the networked storage sub-system.
/scratch unmounted on 57 compute nodes and 20 GPU nodes as well as the gateways. An investigation of all three sub-systems (
/scratch, networked storage) showed that a few LUNs were unavailable due to three of the servers being in a bad state. Those have been cleared and all three storage sub-systems have been stabilized. We continue to keep a close eye on them and are looking into any performance issues.