[Resolved] /scratch and /data partially offline
/scratch and /data are now 100% available again and we have successfully upgraded the cache in all of our GPFS storage appliances.
Please open a helpdesk ticket with us if you have any questions or notice anything unusual.
Update, 10/15/2018, 5pm:
We are still working to get parts of /scratch and /data back online, but unfortunately it looks like it will not be available until tomorrow at the earliest.
During the maintenance this morning, the old (non upgraded) cache on one storage appliance was not correctly flushed (i.e. written to disk) prior to the system powering down. As a result, this appliance is now not booting up cleanly. We are working with the vendor to get the appliance back online as soon as possible, but have been informed that engineers will not be available to look in earnest until tonight.
We are very sorry for the inconvenience. If you have deadlines coming up, please open a helpdesk ticket with us and we will see what we can do.
Note that you are still free to use /scratch and /data, but some data may be unavailable (you may get I/O or bus errors on files you try to read).
Update, 10/15/2018: We encountered a problem while performing the cache upgrades this morning and as a result /scratch and /data remain partially offline. We will provide an update as soon as this is resolved.
This upcoming Monday (Oct 15) beginning at 6AM we will be completing upgrades of the cache on our GPFS storage appliances. We expect this maintenance to last between 1.5 – 2 hours. During this time, reads from /scratch or /data may fail. We are performing the maintenance early in the morning to minimize impact as much as possible; however, any running jobs that are doing active reading from /scratch or /data during the maintenance window may fail.
I/O on /home and /dors will not be affected. SLURM jobs will continue to run so long as they are not reading data from /scratch or /data.
Please let us know if you have any questions or concerns.