S'abonner au fil des news (flux RSS)
We are upgrading the gateway server for iRods. Expect non-working connection for the day.
EIDT: upgrade done. all services working. For users of this service, please read the updated documentation in /data/psmn/
, about the configuration file format (changed).
While the crew was half-brained by COVID, nobody thought to verify the little SubnetManager daemon that was OFF…
Cascade is back to NORMINAL state.
We have a problem on 2 servers for scratches on Cascade cluster : one from /scratch/Cral
, one from /scratch/Cascade
.
They both have a dead infiniband network card. We are waiting for resupply to repair.
Symptoms: Files and/or directories are not available from both /scratch/Cral
or /scratch/Cascade
.
EDIT: We find out, both infiniband cables are dead.
Something went wrong, don't now what yet. On it.
EDIT: Master switch rebooted unexpectidely, killing all network connections between nodes and SIDUS-master ('/' on nodes) → general reboot (in progress) of all nodes, comp & visu.
all running jobs are lost.
EDIT2: expect some delay before everything back to normal…
EDIT3: except a few nodes, back to norminal.
EDIT4: WATCH YOUR JOBS! a large bunch of jobs have been “REQUEUE” by slurm. It may result in “unexpected behaviors”.
A large bunch of Cascade nodes went down (16h15+), probably due to a power spike (large jobs aren't good).
Problem will be handled tomorrow, as no one is on site today.
EDIT 12/12/2023: 4 PSU died, with cascading effects on both nodes and network. Back to norminal.