Kernel issue results in paging
Over the last few weeks we have been investigating incidents reported by several groups regarding compute node sluggishness. This was a particularly challenging analysis since the symptoms were not easily correlated.
We have determined the root cause to be the behavior of the Linux kernel and we are able to reliably reproduce it using scheduled jobs. When a SLURM job begins to reach it’s maximum memory limit, the kernel starts paging data from that process to disk, bringing all compute activity to a crawl on the node. This includes other user jobs on the same node. A quick summary can be found here. The compute node will behave this way despite still having plenty of RAM available and will accept new jobs until it is so slow SLURM removes it from use.
Our team has identified a couple of ways this can be mitigated and we plan on applying those during the next scheduled downtime. In the meantime, three things you can do to reduce the chance of getting caught up in this state are:
smemwatch -k 95 -d 50 $$ &as the first shell command in your script so that your job will internally halt with exit code 9 if it consumes over 95% of its memory allocation
- If your job is large enough, configure it to take up an entire compute node
- Increase your RAM request by 10-15%