S'abonner au fil des news (flux RSS)

Fil des news

20220511 / Debian11 crashed

While doing a big software install on debian 11 master system image, I crashed it… (long story short: a vicious 'apt-get -y upgrade' was not commented out…)

All debian 11 nodes are impacted, mostly crashed or unavailable.

I'll finish the software update before cleaning my mess. Sorry.

UPDATE: upgrade done, cleanup done. nodes OK. slurm OK.

2022/05/11 16:02 · ltaulell

20220506 / Migration News, scratches

  • deb9-deb11 migration: Plans are made to be changed…

Debian11/Slurm Upgrade was not planned to be a one-day operation:

  • E5 cluster will be shutting down, by pieces, to make room for Cascade extension (starting yesterday)
  • Cascade extension will be powered up slowly (mostly during June 2022)

then Summer holydays…

In the meantime:

  • E5 and Lake test nodes *are available* for tests and migrations purposes
    • E5: c82gluster1 is the login node (for now)
    • Lake: c6420node171 is the login node (for now)
    • Cascade: s92node01 is the login node.

Please do test and prepare your slurm scripts…

Be aware that homes and groups/teams storages (/Xnfs) are the same between systems.

Then, in September, we'll see (E5 and Lake clusters final migrations, scratches migrations).

  • Scratches

You are doing it wrong (mostly).

DO NOT store scripts, SGE/slurm logs, small files, source code, binaries on Scratches: it degrade general performance VERY fast, for everyone.

Scratches are meant for large temporary files, and large I/O operations, WITHIN a job. That's all.

DO cleanup!!! Everytime a job is finished, scratch should be clean up (with exception for long workflows)

DO VERIFY your cleanup operations!!

General purpose scratches (E5N/, Lake/) are full *again*. We will erase files older than 90 days next week (blind shoot).

2022/05/06 11:05 · ltaulell

20220505 / E5 cluster, partial poweroff

Queues E5-2670deb128A to D are now disabled and will be powered off definitely next week.

The sliding block puzzle is starting…

2022/05/05 11:24 · ltaulell

20220428 / E5 cluster migration

We will stop parts of E5 cluster (older nodes), begining Week of 2 to 6 of May 2022.

E5 scratch, visu nodes and 'newer' E5 nodes will stay on deb9/SGE system until further notice.

2022/04/28 10:26 · ltaulell

20220331 / Cascade power outage

A S92 chassis burn its power supply unit, making the main power unit to trip.

S92node[01-04,09-12] went down, including jobs…

2022/03/31 14:36 · ltaulell
news/blog.txt · Dernière modification: 2020/08/25 17:58 (modification externe)