Ceci est une ancienne révision du document !


20141007 / /scratch on E5-2670

Explanations will be long, take a coffee, a tea, or quit ;o)

Last week, we experienced a threshold effect (again) while adding new nodes to E5 cluster. It lead us to reboot a big part of debian nodes. Doing this open a Murphy's box (you know the law ? It's the same, with a ribbon on it).

- an electrical problem appear where we add the new nodes, on the same

distribution unit of a block of x5570, including their scratch servers.
-> unexpected reboots while trying to figure out what's going on.

- new E5 nodes cannot connect to “old” infiniband switchs (firmware problem)

  1. > nodes need a new infiniband card firmware, and reboot
  2. > OS kernel need an upgrade, which need to be propagated
  3. > some infiniband cables wont work anymore (wires need an upgrade, maybe)

- new E5 infiniband switch cannot connect to “old” infiniband switchs (firmware

problem) 
-> old switchs need a new firmware, and reboot
-> a special machine, with a very early OS, with experimental libs is needed 
   to upgrade switchs

- old E5 nodes bios is incompatible with new infiniband card firmware…

  1. > all E5 nodes need to be bios upgraded, and reboot

- with all theses, scratch has become incoherent, checkfs needed…

And now ? Upgrades are (mostly) OK, scratch has been checked and is OK. Some E5 nodes are not OK, and have been pushed away from queuing system.

newsfeed/20141007.1412678361.txt.gz · Dernière modification : 2020/08/25 17:58 (modification externe)