there's a ghost network glitch that make compute nodes unavailable
“connexion reset by peer”, “launch failed requeued held”, “JobHeldAdmin” → all the same, the node is gone rogue.
E5, Lake and Cascade partitions are in “DRAIN” mode, awaiting a general reboot of compute nodes.
E5-GPU, Epyc and Cascade are already OK, login nodes and visualization nodes also.
Stay tuned with this newsfeed.
EDIT: nevermind, Cascade need a reboot too…