20240514 / slurmctl and array jobs

slurmctl is back ONLINE

TL;DR:

There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).

And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).

Workaround:

  • on our part, a daily script to cleanup the jobs states
  • on YOUR part : do not submit job arrays on multiples partitions, stick to one only.
newsfeed/20240514.txt · Dernière modification : 2024/05/14 14:03 de ltaulell