S'abonner au fil des news (flux RSS)
Oddly enough, network problems seems to be gone.
“BOFH excuse #51: Cosmic ray particles crashed through the hard disk platter network system.”
Except for a few nodes, still running jobs, but unreachable (so DRAINING mode), everything seems to be OK.
Stay tuned with this newsfeed.
there's a ghost network glitch that make compute nodes unavailable
“connexion reset by peer”, “launch failed requeued held”, “JobHeldAdmin” → all the same, the node is gone rogue.
E5, Lake and Cascade partitions are in “DRAIN” mode, awaiting a general reboot of compute nodes.
E5-GPU, Epyc and Cascade are already OK, login nodes and visualization nodes also.
Stay tuned with this newsfeed.
EDIT: nevermind, Cascade need a reboot too…
rm ~/.ssh/known_hosts
ssh-keygen -f ~/.ssh/known_hosts -R <hostname>
EDIT: inline
rm ~/.ssh/known_hosts
ssh-keygen -R <hostname> -f ~/.ssh/known_hosts
PSMN boot will take some time
We are upgrading all backend ($HOME, /Xnfs, services) servers, and it's soooo long…
We MUST wait for all $HOME and shared volumes (/Xnfs) to be OK before starting login machines.
Some clusters might startup without scratches firstly…
EDIT 17:30: “le propre des emmerdes, c'est de voler en escadrilles…”
planning is delayed…