Ceci est une ancienne révision du document !
S'abonner au fil des news (flux RSS)
Have you seen the PRACE Training Portal yet? Check it out!
Upcoming and current training events by PRACE Advanced Training Centres
An infiniband switch died this morning (mostly in E5-GPU), cutting off scratches and MPI connections. It has been replaced.
REMINDER:
These modifications might take the entire week (due to draining jobs).
Lake-flix and Cascade-flix will be renamed Lake-premium and Cascade-premium later (probably during mandatory October 2024 shutdown)
Documentation has been updated and reflect futur state.
E5
is in DRAIN mode, to be modifiedE5
wall time will be reduced to 4 hoursE5-short
, with 30 minutes walltimeE5-long
, with 8 days walltimeThese modifications might take the entire week (due to draining jobs).
E5
, Lake
, Lake-bigmem
, Epyc
, Cascade
.E5
wall time will be reduced to 4 hoursE5-short
, with 30 minutes walltime
reminder: Access to Lake-bigmem
is subject to authorization, use our forms to ask.
slurmctl is back ONLINE
TL;DR:
There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).
And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).
Workaround:
So, following an “electronic shower” of some sort, a network device decided to stop last night, cutting access to ssh gateways. It has been handled, with hammer force.
database badly corrupted by a job array, manual cleaning in progress.