S'abonner au fil des news (flux RSS)
While doing a big software install on debian 11 master system image, I crashed it… (long story short: a vicious 'apt-get -y upgrade' was not commented out…)
All debian 11 nodes are impacted, mostly crashed or unavailable.
I'll finish the software update before cleaning my mess. Sorry.
UPDATE: upgrade done, cleanup done. nodes OK. slurm OK.
Debian11/Slurm Upgrade was not planned to be a one-day operation:
then Summer holydays…
In the meantime:
Please do test and prepare your slurm scripts…
Be aware that homes and groups/teams storages (/Xnfs) are the same between systems.
Then, in September, we'll see (E5 and Lake clusters final migrations, scratches migrations).
You are doing it wrong (mostly).
DO NOT store scripts, SGE/slurm logs, small files, source code, binaries on Scratches: it degrade general performance VERY fast, for everyone.
Scratches are meant for large temporary files, and large I/O operations, WITHIN a job. That's all.
DO cleanup!!! Everytime a job is finished, scratch should be clean up (with exception for long workflows)
DO VERIFY your cleanup operations!!
General purpose scratches (E5N/, Lake/) are full *again*. We will erase files older than 90 days next week (blind shoot).
Queues E5-2670deb128A to D are now disabled and will be powered off definitely next week.
The sliding block puzzle is starting…
We will stop parts of E5 cluster (older nodes), begining Week of 2 to 6 of May 2022.
E5 scratch, visu nodes and 'newer' E5 nodes will stay on deb9/SGE system until further notice.
A S92 chassis burn its power supply unit, making the main power unit to trip.
S92node[01-04,09-12] went down, including jobs…