slurmctl is back ONLINE
TL;DR:
There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).
And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).
Workaround: