Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentesRévision précédente
en:newsfeed:20141007 [2020/08/25 15:58] – modification externe 127.0.0.1en:newsfeed:20141007 [2023/12/12 12:55] (Version actuelle) – supprimée ltaulell
Ligne 1: Ligne 1:
-====== 20141007 / /scratch on E5-2670 ====== 
- 
-Explanations will be long, take a coffee, a tea, or quit ;o) 
- 
-Last week, we experienced a threshold effect (again) while adding new nodes to E5 cluster. It lead us to reboot a big part of debian nodes. Doing this open a Murphy's box (you know the law ? It's the same, with a ribbon on it). 
- 
-  * an electrical problem appear where we add the new nodes, on the same distribution unit of a block of x5570, including their scratch servers. 
-  -> unexpected reboots while trying to figure out what's going on. 
- 
-  * new E5 nodes cannot connect to "old" infiniband switchs (firmware problem) 
-  -> nodes need a new infiniband card firmware, and reboot 
-  -> OS kernel need an upgrade, which need to be propagated 
-  -> some infiniband cables wont work anymore (wires need an upgrade, maybe) 
- 
-  * new E5 infiniband switch cannot connect to "old" infiniband switchs (firmware problem)  
-  -> old switchs need a new firmware, and reboot 
-  -> a special machine, with a very early OS, with experimental libs is needed to upgrade switchs 
- 
-  * old E5 nodes bios is incompatible with new infiniband card firmware... 
-  -> all E5 nodes need to be bios upgraded, and reboot 
- 
-  * with all theses, scratch has become incoherent, checkfs needed... 
- 
-And now ? Upgrades are (mostly) OK, scratch has been checked and is OK. Some E5 nodes are not OK, and have been pushed away from queuing system. 
- 
- 
- 
-{{tag> hard batch soft }} 
- 
  
en/newsfeed/20141007.1598371112.txt.gz · Dernière modification : 2020/08/25 15:58 de 127.0.0.1