Ceci est une ancienne révision du document !


S'abonner au fil des news (flux RSS)

Fil des news

PRACE Training Portal

Have you seen the PRACE Training Portal yet? Check it out!

Upcoming and current training events by PRACE Advanced Training Centres




20240611 / Dead of the day

An infiniband switch died this morning (mostly in E5-GPU), cutting off scratches and MPI connections. It has been replaced.

REMINDER:

  • starting yesterday:
    • E5 is in DRAIN mode, to be modified:
    • E5 wall time will be reduced to 4 hours walltime
    • new partition E5-short, with 30 minutes walltime
    • new partition E5-long, with 8 days walltime
  • starting today (this morning): CANCELLED
    • Lake-flix and Cascade-flix will be in DRAIN mode, to be renamed Lake-premium and Cascade-premium

These modifications might take the entire week (due to draining jobs).

Lake-flix and Cascade-flix will be renamed Lake-premium and Cascade-premium later (probably during mandatory October 2024 shutdown)

Documentation has been updated and reflect futur state.

2024/06/11 13:29 · ltaulell

20240610 / Partitions modifications

  • starting today:
    • E5 is in DRAIN mode, to be modified
    • E5 wall time will be reduced to 4 hours
    • new partition E5-short, with 30 minutes walltime
    • new partition E5-long, with 8 days walltime
  • starting tomorrow:
    • Lake-flix and Cascade-flix will be in DRAIN mode, to be renamed Lake-premium and Cascade-premium

These modifications might take the entire week (due to draining jobs).

2024/06/10 09:09 · ltaulell

20240604 / Energy crisis

  • As a test, starting tonight, following partitions will no longer be in “hold mode” (PartitionDown) during the day (6h→22h) :
    • E5, Lake, Lake-bigmem, Epyc, Cascade.
  • starting next week: (probably Tuesday 11th of June)
    • E5 wall time will be reduced to 4 hours
    • new partition E5-short, with 30 minutes walltime
    • Lake-flix and Cascade-flix will be renamed Lake-premium and Cascade-premium

reminder: Access to Lake-bigmem is subject to authorization, use our forms to ask.

2024/06/04 14:40 · ltaulell

20240514 / slurmctl and array jobs

slurmctl is back ONLINE

TL;DR:

There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).

And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).

Workaround:

  • on our part, a daily script to cleanup the jobs states
  • on YOUR part : do not submit job arrays on multiples partitions, stick to one only.
2024/05/14 14:03 · ltaulell

20240513 / Network failure

  • BOFH excuses, Chapter 6 -Solar Flares-

So, following an “electronic shower” of some sort, a network device decided to stop last night, cutting access to ssh gateways. It has been handled, with hammer force.

  • slurm controller is DOWN, we are working on it

database badly corrupted by a job array, manual cleaning in progress.

2024/05/13 08:26 · ltaulell
news/blog.1365071268.txt.gz · Dernière modification : 2020/08/25 15:58 (modification externe)