Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédenteProchaine révisionLes deux révisions suivantes | ||
en:documentation:tools:sge [2018/09/04 09:51] – [Surveiller les jobs] cpetit | en:documentation:tools:sge [2020/05/11 16:34] – [GridEngine : Submitting jobs] fleroux | ||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
====== GridEngine ====== | ====== GridEngine ====== | ||
- | <WRAP center round todo 60%> | ||
- | Under contruction... | ||
- | </ | ||
The job scheduler (or batch-queuing system) used in PSMN cluster is SGE, | The job scheduler (or batch-queuing system) used in PSMN cluster is SGE, | ||
Ligne 37: | Ligne 34: | ||
===== GridEngine : Submitting jobs ===== | ===== GridEngine : Submitting jobs ===== | ||
- | The normal way to submit jobs to the cluster is using the '' | + | < |
- | + | qsub programme <input >output | |
- | < | + | |
- | qsub myscript.sh | + | |
</ | </ | ||
- | The many options to the '' | ||
- | |||
- | For example a more complex submission: | ||
<code bash> | <code bash> | ||
- | qsub -V -m b -m e -e / | + | qsub -V -e / |
-V : export environment variables | -V : export environment variables | ||
- | -m b : mail @begin | ||
- | -m e : mail @end | ||
-e : where to put error files | -e : where to put error files | ||
-o : where to put output files | -o : where to put output files | ||
Ligne 57: | Ligne 47: | ||
</ | </ | ||
- | ** Nevertheless, | + | **It is simpler |
- | ==== Comment choisir les files d'attente, de test ou de productions, | + | <note important> |
- | De par les achats successifs de nœuds de calcul avec des architectures de génération différente, il n' | + | Voir [[documentation: |
- | Concrètement le choix de la file d' | + | ==== How to choose the adapted queues for my needs? ==== |
- | * si le critère principal est la rapidité de l' | + | Due to successive purchases of compute nodes with cores/CPU architectures of different generation, it was not possible to define a single queue. It is better to have different queues for each architecture, in order to achieve interesting performance for each queue. |
- | * si le critère principal est le grand nombre de ressources (p.ex. un job avec beaucoup de cœurs, un job avec beaucoup de mémoire RAM), alors il faut plutôt s' | + | |
- | Évidemment, | ||
- | Et bien sur, pour la mise au point du code, il faut choisir une file d'attente de test qui soit la plus proche de la file d'attente de " | + | In concrete terms, the choice of the " |
- | ===== GridEngine : Les autres commandes utiles | + | |
+ | * if the main criterion is the speed of execution, you must look at what are the queues available to accept the job. The use of commands of the type '' | ||
+ | * if the main criterion is the large number of resources (//eg// a job with a lot of cores, a job with a lot of RAM,etc), then you have to move towards the queues that have a large number of resources (at least the resources requested by the job), even if waiting time in the queue is greater. | ||
+ | |||
+ | Obviously, the above command ('' | ||
+ | |||
+ | And, of course, when tuning ypur code, you have to choose a test queue that is closest to the intended | ||
+ | ===== GridEngine : other useful commands | ||
==== Checking job status ==== | ==== Checking job status ==== | ||
Ligne 82: | Ligne 77: | ||
< | < | ||
+ | |||
+ | * display nodes status in a given queue: | ||
+ | |||
+ | < | ||
* display the running jobs of all users: | * display the running jobs of all users: | ||
Ligne 119: | Ligne 118: | ||
==== Accounting ==== | ==== Accounting ==== | ||
- | <note important> | + | <note important> |
- | * Détails des jobs sur les 30 derniers jours : | + | * Job details for the last 30 days: |
< | < | ||
- | * Consommation d' | + | * CPU hours consumption |
< | < | ||
Ligne 134: | Ligne 133: | ||
< | < | ||
- | * Consommation d' | + | * CPU hours consumption(utime |
< | < | ||
- | ==== Jobs qui ont des problèmes | + | ==== Troubleshootings: ==== |
- | Lancer la commande suivante | + | Run the command: |
< | < | ||
- | et regarder les deux dernières colonnes | + | on the output, look at the last two columns: |
- | * aoACD : Nombre de coeurs qui sont au moins dans un des états suivants | + | |
- | * a Load threshold alarm | + | |
- | * o Orphaned | + | |
- | * A Suspend threshold alarm | + | |
- | * C Suspended by calendar | + | |
- | * D Disabled by calendar | + | |
- | * cdsuE : Nombre de coeurs qui sont au moins dans un des états suivants | + | |
- | * c Configuration ambiguous | + | |
- | * d Disabled | + | |
- | * s Suspended | + | |
- | * u Unknown | + | |
- | * E Error | + | |
- | ==== Status (etats) de jobs possibles | + | ==== Possible job status: ==== |
- | * d(eletion), | + | |
- | * E(rror), | + | |
- | * h(old), | + | |
- | * r(unning), | + | |
- | * R(estarted), | + | |
- | * s(uspended), | + | |
- | * S(uspended), | + | |
- | * t(ransfering), | + | |
- | * T(hreshold), | + | |
- | * w(aiting). | + | |
- | ===== GridEngine : Variables d' | + | ===== GridEngine: |
- | <note important>''# | + | <note important>''# |
- | * SGE_O_WORKDIR : répertoire d'où à été soumis le job, utilisable dans les scripts | + | * SGE_O_WORKDIR : directory where the job was submited, re-usable in scripts |
- | * NSLOTS : nombre de coeurs demandé | + | * NSLOTS : number of slots/cores requested |
- | * JOB_ID : ID du job (unique) | + | * JOB_ID : job id (unique) |
- | * JOB_NAME : nom du job (-N) | + | * JOB_NAME : name of the job (-N) |
- | * PE_HOSTFILE : fichier de hosts | + | * PE_HOSTFILE : hosts files (for MPI jobs) |
- | ===== Références | + | ===== References |
* http:// | * http:// | ||
Ligne 196: | Ligne 195: | ||
* http:// | * http:// | ||
* https:// | * https:// | ||
- | * http:// | + | * http:// |