Ceci est une ancienne révision du document !


S'abonner au fil des news (flux RSS)

Fil des news

20240611 / Dead of the day

An infiniband switch died this morning (mostly in E5-GPU), cutting off scratches and MPI connections. It has been replaced.

REMINDER:

  • starting yesterday:
    • E5 is in DRAIN mode, to be modified:
    • E5 wall time will be reduced to 4 hours walltime
    • new partition E5-short, with 30 minutes walltime
    • new partition E5-long, with 8 days walltime
  • starting today (this morning): CANCELLED
    • Lake-flix and Cascade-flix will be in DRAIN mode, to be renamed Lake-premium and Cascade-premium

These modifications might take the entire week (due to draining jobs).

Lake-flix and Cascade-flix will be renamed Lake-premium and Cascade-premium later (probably during mandatory October 2024 shutdown)

Documentation has been updated and reflect futur state.

2024/06/11 13:29 · ltaulell

20240610 / Partitions modifications

  • starting today:
    • E5 is in DRAIN mode, to be modified
    • E5 wall time will be reduced to 4 hours
    • new partition E5-short, with 30 minutes walltime
    • new partition E5-long, with 8 days walltime
  • starting tomorrow:
    • Lake-flix and Cascade-flix will be in DRAIN mode, to be renamed Lake-premium and Cascade-premium

These modifications might take the entire week (due to draining jobs).

2024/06/10 09:09 · ltaulell

20240604 / Energy crisis

  • As a test, starting tonight, following partitions will no longer be in “hold mode” (PartitionDown) during the day (6h→22h) :
    • E5, Lake, Lake-bigmem, Epyc, Cascade.
  • starting next week: (probably Tuesday 11th of June)
    • E5 wall time will be reduced to 4 hours
    • new partition E5-short, with 30 minutes walltime
    • Lake-flix and Cascade-flix will be renamed Lake-premium and Cascade-premium

reminder: Access to Lake-bigmem is subject to authorization, use our forms to ask.

2024/06/04 14:40 · ltaulell

20240514 / slurmctl and array jobs

slurmctl is back ONLINE

TL;DR:

There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).

And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).

Workaround:

  • on our part, a daily script to cleanup the jobs states
  • on YOUR part : do not submit job arrays on multiples partitions, stick to one only.
2024/05/14 14:03 · ltaulell

20240513 / Network failure

  • BOFH excuses, Chapter 6 -Solar Flares-

So, following an “electronic shower” of some sort, a network device decided to stop last night, cutting access to ssh gateways. It has been handled, with hammer force.

  • slurm controller is DOWN, we are working on it

database badly corrupted by a job array, manual cleaning in progress.

2024/05/13 08:26 · ltaulell
news/blog.1432824850.txt.gz · Dernière modification : 2020/08/25 15:58 (modification externe)