S'abonner au fil des news (flux RSS)

Fil des news

20240214 / upgrade ongoing on iRods

We are upgrading the gateway server for iRods. Expect non-working connection for the day.

EIDT: upgrade done. all services working. For users of this service, please read the updated documentation in /data/psmn/, about the configuration file format (changed).

2024/02/14 08:45 · ltaulell

20240123 / scratches on Cascade

While the crew was half-brained by COVID, nobody thought to verify the little SubnetManager daemon that was OFF…

Cascade is back to NORMINAL state.

2024/01/23 11:18 · ltaulell

20240118 / scratches on Cascade

We have a problem on 2 servers for scratches on Cascade cluster : one from /scratch/Cral, one from /scratch/Cascade. They both have a dead infiniband network card. We are waiting for resupply to repair.

Symptoms: Files and/or directories are not available from both /scratch/Cral or /scratch/Cascade.

EDIT: We find out, both infiniband cables are dead.

2024/01/18 10:30 · ltaulell

20240110 / network down, HPC down

Something went wrong, don't now what yet. On it.

EDIT: Master switch rebooted unexpectidely, killing all network connections between nodes and SIDUS-master ('/' on nodes) → general reboot (in progress) of all nodes, comp & visu.

all running jobs are lost.

EDIT2: expect some delay before everything back to normal…

EDIT3: except a few nodes, back to norminal.

EDIT4: WATCH YOUR JOBS! a large bunch of jobs have been “REQUEUE” by slurm. It may result in “unexpected behaviors”.

2024/01/10 09:38 · ltaulell

20231211 / Cascade partially down

A large bunch of Cascade nodes went down (16h15+), probably due to a power spike (large jobs aren't good).

Problem will be handled tomorrow, as no one is on site today.

EDIT 12/12/2023: 4 PSU died, with cascading effects on both nodes and network. Back to norminal.

2023/12/11 16:18 · ltaulell
news/blog.txt · Dernière modification : 2020/08/25 15:58 de 127.0.0.1