Ceci est une ancienne révision du document !
S'abonner au fil des news (flux RSS)
An infiniband switch died this morning (mostly in E5-GPU), cutting off scratches and MPI connections. It has been replaced.
REMINDER:
These modifications might take the entire week (due to draining jobs).
Lake-flix and Cascade-flix will be renamed Lake-premium and Cascade-premium later (probably during mandatory October 2024 shutdown)
Documentation has been updated and reflect futur state.
E5
is in DRAIN mode, to be modifiedE5
wall time will be reduced to 4 hoursE5-short
, with 30 minutes walltimeE5-long
, with 8 days walltimeThese modifications might take the entire week (due to draining jobs).
E5
, Lake
, Lake-bigmem
, Epyc
, Cascade
.E5
wall time will be reduced to 4 hoursE5-short
, with 30 minutes walltime
reminder: Access to Lake-bigmem
is subject to authorization, use our forms to ask.
slurmctl is back ONLINE
TL;DR:
There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).
And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).
Workaround:
So, following an “electronic shower” of some sort, a network device decided to stop last night, cutting access to ssh gateways. It has been handled, with hammer force.
database badly corrupted by a job array, manual cleaning in progress.