newsfeed:20240514

20240514 / slurmctl and array jobs

slurmctl is back ONLINE

TL;DR:

There is a known bug in our version of slurm, where in a large array job, if two subtasks fail at the same time, one will be left stuck in FAIL/REQUEUE mode indefinitely. This can segfault the slurm controller at restart (like when rotating log, for example).

And things go sideways in the accounting database very fast (it took only 3 seconds to hang the database and segfault the controller).

Workaround:

on our part, a daily script to cleanup the jobs states
on YOUR part : do not submit job arrays on multiples partitions, stick to one only.

slurm

newsfeed/20240514.txt · Dernière modification : 2024/05/14 14:03 de ltaulell

Rechercher

Translations

Traductions de cette page:

Piste :

Vous êtes ici : accueil » newsfeed » 20240514

Navigation

accueil
ateliers
contact
documentation
en
faq
mesocentre
news
newsfeed
- 20120203
- 20130108
- 20130312
- 20130319
- 20130322
- 20130325
- 20130326
- 20130327
- 20130328
- 20130402
- 20130404
- 20130408
- 20130410
- 20130419
- 20130422
- 20130424
- 20130507
- 20130513
- 20130608
- 20130621
- 20130625
- 20130626
- 20130628
- 20130702
- 20130709
- 20130711
- 20130722
- 20130826
- 20130827.1
- 20130830
- 20130831
- 20130904
- 20130909
- 20130910
- 20130916
- 20130920
- 20130924
- 20130925
- 20131003
- 20131015
- 20131024
- 20131029
- 20131031
- 20131118
- 20131202
- 20131211
- 20131212
- 20131213
- 20131217
- 20131218
- 20140106
- 20140114
- 20140122
- 20140130
- 20140131
- 20140324
- 20140325
- 20140331
- 20140401
- 20140416
- 20140430
- 20140501
- 20140505
- 20140512
- 20140520
- 20140611
- 20140612
- 20140613
- 20140711
- 20140725
- 20140917
- 20140918
- 20140929
- 20141002
- 20141003
- 20141007
- 20141008
- 20141009
- 20141027
- 20141029
- 20141030
- 20141031
- 20141105
- 20141208
- 20141215
- 20141222
- 20150115
- 20150123
- 20150126
- 20150202
- 20150203
- 20150224
- 20150304
- 20150318
- 20150319
- 20150325
- 20150326
- 20150330
- 20150331
- 20150401
- 20150416
- 20150421
- 20150511
- 20150528
- 20150603
- 20150608
- 20150622
- 20150708
- 20150826
- 20150928
- 20150929
- 20151012
- 20151020
- 20151026
- 20151027
- 20151106
- 20151109
- 20151112
- 20151126
- 20151201
- 20151203
- 20151204
- 20151211
- 20151216
- 20160113
- 20160121
- 20160203
- 20160208
- 20160212
- 20160216
- 20160224
- 20160321
- 20160407
- 20160418
- 20160517
- 20160523
- 20160601
- 20160607
- 20160610
- 20160614
- 20160907
- 20160919
- 20160923
- 20161123
- 20161208
- 20161226
- 20170102
- 20170117
- 20170119
- 20170220
- 20170222
- 20170307
- 20170321
- 20170410
- 20170421
- 20170426
- 20170531
- 20170608
- 20170612
- 20170619
- 20170620
- 20170622
- 20170710
- 20170711
- 20170713
- 20170731
- 20170803
- 20170816
- 20170913
- 20170917
- 20171024
- 20171107
- 20171108
- 20171121
- 20171122
- 20171124
- 20171129
- 20171204
- 20171206
- 20171207
- 20171208
- 20171211
- 20171213
- 20171214
- 20171215
- 20171218
- 20171219
- 20171220
- 20171221
- 20171222
- 20171227
- 20171726
- 20180103
- 20180104
- 20180109
- 20180111
- 20180115
- 20180207
- 20180309
- 20180401
- 20180411
- 20180417
- 20180427
- 20180430
- 20180518
- 20180523
- 20180605
- 20180606
- 20180626
- 20180709
- 20180830
- 20181001
- 20181022
- 20181026
- 20181220
- 20190117
- 20190121
- 20190214
- 20190219
- 20190305
- 20190315
- 20190328
- 20190405
- 20190420
- 20190424
- 20190505
- 20190509
- 20190513
- 20190516
- 20190615
- 20190729
- 20190807
- 20190808
- 20190828
- 20190926
- 20191001
- 20191121
- 20191125
- 20191128
- 20191202
- 20191204
- 20191206
- 20191209
- 20200123
- 20200128
- 20200130
- 20200203
- 20200207
- 20200214
- 20200227
- 20200311
- 20200318
- 20200320
- 20200406
- 20200428
- 20200429
- 20200525
- 20200526
- 20200613
- 20200618
- 20200622
- 20200709
- 20200710
- 20200715
- 20200720
- 20200721
- 20200722
- 20200724
- 20200727
- 20200831
- 20200903
- 20200916
- 20200921
- 20201005
- 20201007
- 20201012
- 20201015
- 20201023
- 20201026
- 20201027
- 20201028
- 20201102
- 20201104
- 20201109
- 20201116
- 20201124
- 20201207
- 20201226
- 20210104
- 20210105
- 20210106
- 20210112
- 20210202
- 20210218
- 20210302
- 20210317
- 20210408
- 20210430
- 20210602
- 20210629
- 20210702
- 20210716
- 20210720
- 20210824
- 20210826
- 20211011
- 20211012
- 20211022
- 20211025
- 20211026
- 20211217
- 20220103
- 20220111
- 20220118
- 20220121
- 20220314
- 20220317
- 20220318
- 20220331
- 20220428
- 20220505
- 20220506
- 20220511
- 20220613
- 20220617
- 20220623
- 20220627
- 20220630
- 20220816
- 20220825
- 20220908
- 20220913
- 20220920
- 20220926_acoustics_measurements
- 20220927
- 20220929
- 20221010
- 20221011
- 20221021
- 20221024
- 20221025
- 20221026
- 20221027
- 20221028
- 20221117
- 20221121
- 20221128
- 20221214
- 20221215
- 20221216
- 20221218
- 20230105
- 20230110
- 20230111
- 20230118
- 20230127
- 20230202
- 20230210
- 20230213
- 20230220
- 20230223
- 20230309
- 20230316
- 20230322
- 20230417
- 20230419
- 20230512
- 20230531
- 20230613
- 20230615
- 20230704
- 20230705
- 20230710
- 20230711
- 20230719
- 20230721
- 20230927
- 20231003
- 20231010
- 20231012
- 20231017
- 20231019
- 20231024
- 20231025
- 20231026
- 20231030
- 20231114
- 20231116
- 20231206
- 20231211
- 20240110
- 20240118
- 20240123
- 20240214
- 20240215
- 20240227
- 20240314
- 20240321
- 20240322
- 20240402
- 20240408
- 20240410
- 20240422
- 20240425
- 20240502
- 20240503
- 20240513
- 20240514
- 20240604
- 20240610
- 20240611
- 20240618
- 20240703
- 20240717
- 20240725
- 20240826
- 20240924
- 20241001
- 20241003
- 20241007
- 20241025
- 20241029
- 20241112
- 20241204
- 20241210
- 20241217
- 20250113
- 20250130
- 20250211
- 20250213
- 20250218
- 20250219
- 20250220
- 20250318
- 20250320
- 20250326
- 20250402
- 20250411
- 20250429
- 20250605
- 20250611
- 20250627
- 20250630
- 20250707
science
tag
wiki
menu
navigation