S'abonner au fil des news (flux RSS)
A large bunch of Cascade nodes went down (16h15+), probably due to a power spike (large jobs aren't good).
Problem will be handled tomorrow, as no one is on site today.
EDIT 12/12/2023: 4 PSU died, with cascading effects on both nodes and network. Back to norminal.
We are encountering strange network behaviors, and nodes are crashing one after another.
We might need to perform a global reboot of all nodes…
EDIT [09:50]
One of our main NFS server (/applis/PSMN) was stuck in a loop since yesterday evening, blocking all / access.
Things should be back to normal (no global reboot \o/). Jobs may have been blocked doing nothing all night.
Lake-flix and Cascade-flix are open to everybody, for short duration (no longer than 2 days is best, but standard walltime apply) small parallel and sequential jobs, with requeue in case of high priority jobs (see documentation)
example:
#SBATCH --partition=Lake,Lake-flix # or #SBATCH --partition=Cascade,Cascade-flix
The /Xnfs/abc
volume will be moved to a new server Thursday 16th of November, in the morning.
Any nf (NextFlow) running at that time might need a restart if crashed.
A disk on data8 (main CRAL fileserver) was… not well. After a good hammer blow, all export were restarted.
homes and exports may have been unavailable for a few moments.