S'abonner au fil des news (flux RSS)

Fil des news

20240118 / scratches on Cascade

We have a problem on 2 servers for scratches on Cascade cluster : one from /scratch/Cral, one from /scratch/Cascade. They both have a dead infiniband network card. We are waiting for resupply to repair.

Symptoms: Files and/or directories are not available from both /scratch/Cral or /scratch/Cascade.

EDIT: We find out, both infiniband cables are dead.

2024/01/18 10:30 · ltaulell

20240110 / network down, HPC down

Something went wrong, don't now what yet. On it.

EDIT: Master switch rebooted unexpectidely, killing all network connections between nodes and SIDUS-master ('/' on nodes) → general reboot (in progress) of all nodes, comp & visu.

all running jobs are lost.

EDIT2: expect some delay before everything back to normal…

EDIT3: except a few nodes, back to norminal.

EDIT4: WATCH YOUR JOBS! a large bunch of jobs have been “REQUEUE” by slurm. It may result in “unexpected behaviors”.

2024/01/10 09:38 · ltaulell

20231211 / Cascade partially down

A large bunch of Cascade nodes went down (16h15+), probably due to a power spike (large jobs aren't good).

Problem will be handled tomorrow, as no one is on site today.

EDIT 12/12/2023: 4 PSU died, with cascading effects on both nodes and network. Back to norminal.

2023/12/11 16:18 · ltaulell

20231206 / global breakdown

We are encountering strange network behaviors, and nodes are crashing one after another.

We might need to perform a global reboot of all nodes…

EDIT [09:50]

One of our main NFS server (/applis/PSMN) was stuck in a loop since yesterday evening, blocking all / access.

Things should be back to normal (no global reboot \o/). Jobs may have been blocked doing nothing all night.

2023/12/06 08:31 · ltaulell

20231116 / Partitions flix

  • Did you know you can use flix partitions ?

Lake-flix and Cascade-flix are open to everybody, for short duration (no longer than 2 days is best, but standard walltime apply) small parallel and sequential jobs, with requeue in case of high priority jobs (see documentation)

example:

#SBATCH --partition=Lake,Lake-flix
# or
#SBATCH --partition=Cascade,Cascade-flix
  • /Xnfs/abc volume has been moved and is usable
2023/11/16 16:25 · ltaulell
news/blog.txt · Dernière modification : 2020/08/25 15:58 de 127.0.0.1