Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente | ||
en:documentation:tools:sge [2018/09/04 11:41] – [GridEngine : Variables d'environnement] cpetit | en:documentation:tools:sge [2023/12/12 12:55] (Version actuelle) – supprimée ltaulell | ||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
- | ====== GridEngine ====== | ||
- | |||
- | <WRAP center round todo 60%> | ||
- | Under contruction... | ||
- | </ | ||
- | |||
- | The job scheduler (or batch-queuing system) used in PSMN cluster is SGE, | ||
- | |||
- | ● __Infrastructure__: | ||
- | |||
- | |||
- | * the **compilation servers** are described [[documentation: | ||
- | * the **cluster hardware configuration** is described [[documentation: | ||
- | * the **queues for job submission** are described [[documentation: | ||
- | {{ : | ||
- | ==== Optimum use of resources ==== | ||
- | |||
- | To make the best use of resources, it is important to fill up the servers. For this, they are two strategies: | ||
- | |||
- | |||
- | * fill in "at best" (a best effort strategy), | ||
- | * fill in with multiple of n cores (where n is the number of physical cores per server). | ||
- | |||
- | Filling at best quickly leads to a fragmentation between the servers, of parallel applications via mpi. | ||
- | |||
- | Filling at best is therefore only implemented on a few queues for parallel applications. On other queues, the fill in with the multiple of n cores of an entire server is used. | ||
- | |||
- | ==== Priorities ==== | ||
- | |||
- | Job priority is: | ||
- | |||
- | * inversely proportional to the calculation time already consumed, | ||
- | * proportional to waiting time and number of hearts requested. | ||
- | |||
- | This is to distribute the available resources more equitably. | ||
- | |||
- | ===== GridEngine : Submitting jobs ===== | ||
- | |||
- | The normal way to submit jobs to the cluster is using the '' | ||
- | |||
- | <code bash> | ||
- | qsub myscript.sh | ||
- | </ | ||
- | |||
- | The many options to the '' | ||
- | |||
- | For example a more complex submission: | ||
- | <code bash> | ||
- | qsub -V -m b -m e -e / | ||
- | |||
- | -V : export environment variables | ||
- | -m b : mail @begin | ||
- | -m e : mail @end | ||
- | -e : where to put error files | ||
- | -o : where to put output files | ||
- | -q : queue (file d' | ||
- | </ | ||
- | |||
- | ** Nevertheless, | ||
- | |||
- | ==== Comment choisir les files d' | ||
- | |||
- | De par les achats successifs de nœuds de calcul avec des architectures de génération différente, | ||
- | |||
- | Concrètement le choix de la file d' | ||
- | |||
- | * si le critère principal est la rapidité de l' | ||
- | * si le critère principal est le grand nombre de ressources (p.ex. un job avec beaucoup de cœurs, un job avec beaucoup de mémoire RAM), alors il faut plutôt s' | ||
- | |||
- | Évidemment, | ||
- | |||
- | Et bien sur, pour la mise au point du code, il faut choisir une file d' | ||
- | ===== GridEngine : Les autres commandes utiles ===== | ||
- | |||
- | ==== Checking job status ==== | ||
- | |||
- | * display job status of a specific user: | ||
- | |||
- | < | ||
- | |||
- | * display queues status (and list of queues): | ||
- | |||
- | < | ||
- | |||
- | * display the running jobs of all users: | ||
- | |||
- | < | ||
- | |||
- | * display the pending jobs of all users: | ||
- | |||
- | < | ||
- | |||
- | |||
- | * display the status of a job in progress: | ||
- | |||
- | < | ||
- | |||
- | * display the status of a job in progress with more details (longer): | ||
- | |||
- | < | ||
- | |||
- | * display the status of a job in progress with even more details (even longer): | ||
- | |||
- | < | ||
- | |||
- | |||
- | * display information on a job afetr its completion (long): | ||
- | |||
- | < | ||
- | |||
- | * delete a job: | ||
- | |||
- | < | ||
- | |||
- | * delete a job (force deletion) : | ||
- | |||
- | < | ||
- | |||
- | ==== Accounting ==== | ||
- | |||
- | <note important> | ||
- | |||
- | |||
- | * Détails des jobs sur les 30 derniers jours : | ||
- | |||
- | < | ||
- | |||
- | * Consommation d' | ||
- | |||
- | < | ||
- | |||
- | ou | ||
- | |||
- | < | ||
- | |||
- | * Consommation d' | ||
- | |||
- | < | ||
- | |||
- | ==== Troubleshootings: | ||
- | |||
- | Run the command: | ||
- | < | ||
- | |||
- | on the output, look at the last two columns: | ||
- | |||
- | * **aoACD** : Number of cores that are at least in one of the following states: | ||
- | * **a** Load threshold alarm | ||
- | * **o** Orphaned | ||
- | * **A**** Suspend threshold alarm | ||
- | * **C Suspended by calendar | ||
- | * **D** Disabled by calendar | ||
- | |||
- | * **cdsuE** : Number of cores that are at least in one of the following states: | ||
- | * **c** Configuration ambiguous | ||
- | * **d** Disabled | ||
- | * **s** Suspended | ||
- | * **u** Unknown | ||
- | * **E** Error | ||
- | |||
- | ==== Possible job status: ==== | ||
- | |||
- | * **d**(eletion), | ||
- | * **E**(rror), | ||
- | * **h**(old), | ||
- | * **r**(unning), | ||
- | * **R**(estarted), | ||
- | * **s**(uspended), | ||
- | * **S**(uspended), | ||
- | * **t**(ransfering), | ||
- | * **T**(hreshold), | ||
- | * **w**(aiting). | ||
- | |||
- | ===== GridEngine: Environment variables ===== | ||
- | |||
- | <note important>''# | ||
- | |||
- | * SGE_O_WORKDIR : directory where the job was submited, re-usable in scripts | ||
- | * NSLOTS : number of slots/cores requested | ||
- | * JOB_ID : job id (unique) assigned by GridEngine | ||
- | * JOB_NAME : name of the job (-N) | ||
- | * PE_HOSTFILE : hosts files (for MPI jobs) | ||
- | |||
- | |||
- | ===== Références : ===== | ||
- | |||
- | * http:// | ||
- | * http:// | ||
- | * http:// | ||
- | * http:// | ||
- | * https:// | ||
- | * https:// | ||
- | * https:// | ||
- | * https:// | ||
- | * http:// | ||
- | * http:// | ||
- | * https:// | ||
- | * http:// | ||