Storage issues with /scratch and networked storage
At around 10am this morning we received alerts for the /scratch
storage sub-system and subsequently for the networked storage sub-system. /scratch
unmounted on 57 compute nodes and 20 GPU nodes as well as the gateways. An investigation of all three sub-systems (/data
, /scratch
, networked storage) showed that a few LUNs were unavailable due to three of the servers being in a bad state. Those have been cleared and all three storage sub-systems have been stabilized. We continue to keep a close eye on them and are looking into any performance issues.