S'abonner au fil des news (flux RSS)

Fil des news

20221028 / Finally, it runs

Oddly enough, network problems seems to be gone.

“BOFH excuse #51: Cosmic ray particles crashed through the hard disk platter network system.”

Except for a few nodes, still running jobs, but unreachable (so DRAINING mode), everything seems to be OK.

DO NOT WRITE DIRECTLY TO PSMN STAFF, USE THE WEB FORMS: Formulaires du PSMN

Stay tuned with this newsfeed.

2022/10/28 08:26 · ltaulell

20221027 / prepare for reboot...

there's a ghost network glitch that make compute nodes unavailable

“connexion reset by peer”, “launch failed requeued held”, “JobHeldAdmin” → all the same, the node is gone rogue.

E5, Lake and Cascade partitions are in “DRAIN” mode, awaiting a general reboot of compute nodes.

E5-GPU, Epyc and Cascade are already OK, login nodes and visualization nodes also.

DO NOT WRITE DIRECTLY TO PSMN STAFF, USE THE WEB FORMS: Formulaires du PSMN

Stay tuned with this newsfeed.

EDIT: nevermind, Cascade need a reboot too…

2022/10/27 13:36 · ltaulell

20221026 / Full startup

  • All $HOME and /Xnfs: UP
  • allo-psmn & ssh.psmn: UP
  • Cascade: UP, with scratch
  • E5: UP, with new scratch
  • E5-GPU: UP, with new scratch (Lake)
  • Lake: UP, with new scratch
  • Epyc: UP, with new scratch (Lake)
  • Visualization servers are ON, with scratch (E5N, Lake)
SSH hostkeys are new, please refresh (on allo-psmn), either:
  • rm ~/.ssh/known_hosts
  • ssh-keygen -f ~/.ssh/known_hosts -R <hostname>
2022/10/26 09:45 · ltaulell

20221025 / Slow startup

  • All $HOME and /Xnfs: UP
  • allo-psmn & ssh.psmn: UP
  • Cascade: UP, with scratch
  • E5: UP, with new scratch
  • E5-GPU: config problem
  • Lake: config problem (but new scratches UP)
  • Epyc: weirdly not really OFF, not really ON, config problem
  • Visualization servers (most) are ON, with scratch
  • Documentation is NOT up-to-date.

EDIT: inline

SSH hostkeys are new, please refresh, either:
  • rm ~/.ssh/known_hosts
  • ssh-keygen -R <hostname> -f ~/.ssh/known_hosts
2022/10/25 12:21 · ltaulell

20221024 / upgrade before boot

PSMN boot will take some time

We are upgrading all backend ($HOME, /Xnfs, services) servers, and it's soooo long…

We MUST wait for all $HOME and shared volumes (/Xnfs) to be OK before starting login machines.

Some clusters might startup without scratches firstly…

EDIT 17:30: “le propre des emmerdes, c'est de voler en escadrilles…”

  • one main switch is dead
  • one fileserver is dead (but data are OK, scripts and access paths need to be modified)

planning is delayed…

2022/10/24 10:04 · ltaulell
news/blog.txt · Dernière modification : 2020/08/25 15:58 de 127.0.0.1