20240110 / network down, HPC down

Something went wrong, don't now what yet. On it.

EDIT: Master switch rebooted unexpectidely, killing all network connections between nodes and SIDUS-master ('/' on nodes) → general reboot (in progress) of all nodes, comp & visu.

all running jobs are lost.

EDIT2: expect some delay before everything back to normal…

EDIT3: except a few nodes, back to norminal.

EDIT4: WATCH YOUR JOBS! a large bunch of jobs have been “REQUEUE” by slurm. It may result in “unexpected behaviors”.

newsfeed/20240110.txt · Dernière modification : 2024/01/10 14:12 de ltaulell