[Resolved] SLURM scheduler is back online following outage
Update, 3/5/2019: The scheduler is now operational. The impact on the cluster queue has been minimal. We are investigating to establish the exact reason of the stuck jobs in order to prevent this to happen again. Thank you for your patience.
We are currently experiencing a SLURM overload caused by issues in killing processes related to about 1500 jobs on more than 400 compute nodes. Although jobs currently in queue are scheduled as usual, until the processes are cleared you will not be able to query the scheduler or submit jobs.
We are working to fix this issue as soon as possible. We apologize for the inconvenience.