Skip to main content

[Resolved] SLURM Socket Timeout Errors

Posted by on Monday, March 26, 2018 in Cluster Status Notice.

Updated 4/3, 2:44pm:

We will continue to monitor SLURM responsiveness closely, but for now it is very good. Please submit a ticket if you encounter any further problems. Thanks!

 

Updated 3/28, 1:35pm:

We are continuing to work on SLURM responsiveness, though it has improved since Sunday. We are in communication with the makers of SLURM and hope to resolve it soon. Thanks for your patience!

 

Updated 3/26, 6:12pm:

SLURM has overall been more responsive today. We have identified a few potentially problematic workflows and are working with those users/groups to make appropriate changes.
As a reminder:
– Please avoid large groups (>300) of jobs that do not use job arrays.
– Please avoid large groups of jobs that each run for less than 30 minutes. Bundle multiple short-running jobs into a single job. We highly prefer 1 6-hour job as opposed to 360 1-minute jobs. Depending on how busy the cluster is and your resource request, the former example may actually be faster to complete due to shorter queue times.
– If your group uses some sort of automated pipeline (we are aware of roughly 6-8 groups doing this, but there are probably more) for submitting and monitoring jobs, please be prudent about how often you are requesting information from SLURM with commands like squeue and scontrol.
SLURM sluggishness is generally the cumulative effect of multiple users not following the best practices we outline above. If you are unsure, please reach out to us. And please also let us know if you have recommendations or suggestions. As I mentioned last night, we are in the process of moving SLURM to solid state drives, and we have a few other ideas we are considering internally.
There are roughly 900 unique researchers across VU and VUMC making use of ACCRE resources. Please be responsible and mindful that this is a shared resource and your decisions may impact other researchers.

Original post:
For about the past week our job scheduler, SLURM, has been sluggish. The sluggishness has been especially bad over the weekend. Often SLURM commands (e.g. squeue or sbatch) may timeout with “socket timeout” errors, or be very slow to complete. We are very sorry for the inconvenience and frustration this has caused.
This problem generally occurs when the job scheduler is overloaded due to large batches of non-array jobs or extremely short jobs (see: https://www.vanderbilt.edu/accre/support/faq/#a-slurm-command-fails-with-a-socket-timeout-message-whats-the-problem), however neither of these factors appear to be playing a role in this case, so far as we can tell. We have made a few changes tonight in an attempt to improve SLURM responsiveness. If things have not improved by the morning, we will open a ticket with the SLURM developers for input and advice.
If you are actively running jobs, please make sure you review the link above carefully and make changes to your workflow if necessary. We are available to assist.
Note that jobs the are already running should not be impacted by these problems (unless you are invoking other SLURM commands like squeue or srun from within your SLURM job).
We have plans to move SLURM to dedicated solid state drives in the very near future, which we expect to improve responsiveness and greatly reduce the occurrence of socket timeout errors.

Leave a Response

You must be logged in to post a comment