20200710 / Post-Mortem

The correct assumption is “Shitstorm hit the fan.” (I stand corrected)

We are not done yet:

  • ssh.psmn is under attack from a botnet, that's why “maximum authentication attempts exceeded”,
  • master LDAP server is down. We are running from slave1 (backup from yesterday),
  • All scratch are almost back (expect for some nodes on E5 and X5),
  • /homes and /Xnfs should be OK everywhere (“should”, as in “remount is ongoing”),

EDIT 13:00: master LDAP server is back online \o/ !

EDIT 13:50: Cluster X5 is fully up & running.

EDIT 14:05: Clusters E5 and Lake up & running.

