====== 20240514 / slurmctl and array jobs ======

**slurmctl is back ONLINE**

TL;DR:

There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This //can// segfault the slurm controller at restart (like when rotating log, for example).

And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).

Workaround:

  * on our part, a daily script to cleanup the jobs states
  * on **YOUR** part : do not submit job arrays on multiples partitions, stick to one only.

{{tag> slurm}}