S'abonner au fil des news (flux RSS)

Fil des news

20180115 / General reboot TODAY

Beginning at 14:30 today :

  • A security upgrade will be installed, on each server, with a reboot,
  • E5 and x55 /scratch will be destroy (data loss) and restart from zero,

EDIT:20180115@17:05

  • reboot is OK, except visualisation servers and some nodes,
  • all /scratch OK,
  • all fileservers OK,
  • all services OK,
  • compute nodes almost OK,
2018/01/15 08:43 · ltaulell

20180111 / Power blackout

We experienced our first power blackout today (~1mn poweroff). As there is currently no active supervision in the new datacenter (still in testing mode), we are… slightly in the dark.

  • known issues on OpenMPI runs (thanks to systemd, ulimit is not working as expected) solved
  • Many softwares to reinstall (>600)…
2018/01/11 11:27 · ltaulell

20180109 / Reboot next monday

WE ARE IN TESTING MODE. REDUCED PRODUCTION. FULL ACCESS.

  • /scratch E5 and x5 are in full FUBAR split-brain. Data will be erased next monday (2018/01/15).
  • E5 cluster is working (except for /scratch)
  • softwares are being rebuild
  • documentation is being updated (with your help),
  • Security upgrades

Two security breaches have been announced last week : Spectre and Meltdown.

We need to reboot each server after upgrade, as it's a kernel upgrade.

  • /scratch on E5 and x55 clusters

The glusterfs filesystem (that serve the /scratch on clusters) is beyond repair. There was so many missuses, issues plus errors from previous system that the new autoheal/autorepair included in the new software version cannot do much.

After the analysis of ~6 485 000 000 files (yep, billions), it has find more than 440 000 files in errors or split-brain that cannot be auto-heal nor auto-repair.

As a consequence, the E5 cluster will be restarted with an empty /scratch on next monday.

All data will be lost in both /scratch (E5 and x55), if you need these data, and you can access it, copy them before next monday.

2018/01/09 17:15 · ltaulell

20180104 / Startup in new datacenter

WE ARE IN TESTING MODE. REDUCED PRODUCTION. FULL ACCESS.

  • /scratch E5 and x5 are still in split-brain (very loooong process), data are partially accessible,
  • E5 cluster is ready,
  • Mail is back (was broken today, my bad),
  • Almost all softwares need rebuild,
  • Visualisation machines are online,
  • Documentation is being updated…

Please report any problems/questions with the appropriate form.

2018/01/04 17:02 · ltaulell

20180103 / Relocation in the new datacenter (SING)

WE ARE IN TESTING MODE. REDUCED PRODUCTION. REDUCED ACCESS.

Today's menu:

  • allo-psmn.psmn.ens-lyon.fr (the new one) is up & running,
  • front machines are online too and they properly mount the /scratch (which still have issues),
  • new login/password are to be sent to users,
  • E5 cluster only is online and accepting jobs (check with qstat -g c),
2018/01/03 12:54 · ltaulell
news/blog.txt · Dernière modification : 2020/08/25 15:58 de 127.0.0.1