[Resolved] SLURM Socket Timeout Errors
Updated 4/3, 2:44pm:
We will continue to monitor SLURM responsiveness closely, but for now it is very good. Please submit a ticket if you encounter any further problems. Thanks!
Updated 3/28, 1:35pm:
We are continuing to work on SLURM responsiveness, though it has improved since Sunday. We are in communication with the makers of SLURM and hope to resolve it soon. Thanks for your patience!
Updated 3/26, 6:12pm:
SLURM has overall been more responsive today. We have identified a few potentially problematic workflows and are working with those users/groups to make appropriate changes.
As a reminder:
– Please avoid large groups (>300) of jobs that do not use job arrays.
– Please avoid large groups of jobs that each run for less than 30 minutes. Bundle multiple short-running jobs into a single job. We highly prefer 1 6-hour job as opposed to 360 1-minute jobs. Depending on how busy the cluster is and your resource request, the former example may actually be faster to complete due to shorter queue times.
– If your group uses some sort of automated pipeline (we are aware of roughly 6-8 groups doing this, but there are probably more) for submitting and monitoring jobs, please be prudent about how often you are requesting information from SLURM with commands like squeue and scontrol.
SLURM sluggishness is generally the cumulative effect of multiple users not following the best practices we outline above. If you are unsure, please reach out to us. And please also let us know if you have recommendations or suggestions. As I mentioned last night, we are in the process of moving SLURM to solid state drives, and we have a few other ideas we are considering internally.
There are roughly 900 unique researchers across VU and VUMC making use of ACCRE resources. Please be responsible and mindful that this is a shared resource and your decisions may impact other researchers.
For about the past week our job scheduler, SLURM, has been sluggish. The sluggishness has been especially bad over the weekend. Often SLURM commands (e.g. squeue or sbatch) may timeout with “socket timeout” errors, or be very slow to complete. We are very sorry for the inconvenience and frustration this has caused.
If you are actively running jobs, please make sure you review the link above carefully and make changes to your workflow if necessary. We are available to assist.
Note that jobs the are already running should not be impacted by these problems (unless you are invoking other SLURM commands like squeue or srun from within your SLURM job).
We have plans to move SLURM to dedicated solid state drives in the very near future, which we expect to improve responsiveness and greatly reduce the occurrence of socket timeout errors.