====== 20240514 / slurmctl and array jobs ====== **slurmctl is back ONLINE** TL;DR: There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This //can// segfault the slurm controller at restart (like when rotating log, for example). And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller). Workaround: * on our part, a daily script to cleanup the jobs states * on **YOUR** part : do not submit job arrays on multiples partitions, stick to one only. {{tag> slurm}}