Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentesRévision précédenteProchaine révision | Révision précédente | ||
en:documentation:tools:sge [2018/09/04 11:56] – [How to choose the adapted queues for my needs?] cpetit | en:documentation:tools:sge [2023/12/12 12:55] (Version actuelle) – supprimée ltaulell | ||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
- | ====== GridEngine ====== | ||
- | |||
- | <WRAP center round todo 60%> | ||
- | Under contruction... | ||
- | </ | ||
- | |||
- | The job scheduler (or batch-queuing system) used in PSMN cluster is SGE, | ||
- | |||
- | ● __Infrastructure__: | ||
- | |||
- | |||
- | * the **compilation servers** are described [[documentation: | ||
- | * the **cluster hardware configuration** is described [[documentation: | ||
- | * the **queues for job submission** are described [[documentation: | ||
- | {{ : | ||
- | ==== Optimum use of resources ==== | ||
- | |||
- | To make the best use of resources, it is important to fill up the servers. For this, they are two strategies: | ||
- | |||
- | |||
- | * fill in "at best" (a best effort strategy), | ||
- | * fill in with multiple of n cores (where n is the number of physical cores per server). | ||
- | |||
- | Filling at best quickly leads to a fragmentation between the servers, of parallel applications via mpi. | ||
- | |||
- | Filling at best is therefore only implemented on a few queues for parallel applications. On other queues, the fill in with the multiple of n cores of an entire server is used. | ||
- | |||
- | ==== Priorities ==== | ||
- | |||
- | Job priority is: | ||
- | |||
- | * inversely proportional to the calculation time already consumed, | ||
- | * proportional to waiting time and number of hearts requested. | ||
- | |||
- | This is to distribute the available resources more equitably. | ||
- | |||
- | ===== GridEngine : Submitting jobs ===== | ||
- | |||
- | The normal way to submit jobs to the cluster is using the '' | ||
- | |||
- | <code bash> | ||
- | qsub myscript.sh | ||
- | </ | ||
- | |||
- | The many options to the '' | ||
- | |||
- | For example a more complex submission: | ||
- | <code bash> | ||
- | qsub -V -m b -m e -e / | ||
- | |||
- | -V : export environment variables | ||
- | -m b : mail @begin | ||
- | -m e : mail @end | ||
- | -e : where to put error files | ||
- | -o : where to put output files | ||
- | -q : queue (file d' | ||
- | </ | ||
- | |||
- | ** Nevertheless, | ||
- | |||
- | ==== How to choose the adapted queues for my needs? ==== | ||
- | |||
- | Due to successive purchases of compute nodes with cores/CPU architectures of different generation, it was not possible to define a single queue. It is better to have different queues for each architecture, | ||
- | |||
- | |||
- | In concrete terms, the choice of the " | ||
- | |||
- | * if the main criterion is the speed of execution, you must look at what are the queues available to accept the job. The use of commands of the type '' | ||
- | * if the main criterion is the large number of resources (//eg// a job with a lot of cores, a job with a lot of RAM,etc), then you have to move towards the queues that have a large number of resources (at least the resources requested by the job), even if waiting time in the queue is greater. | ||
- | |||
- | Obviously, the above command ('' | ||
- | |||
- | And, of course, when tuning ypur code, you have to choose a test queue that is closest to the intended " | ||
- | ===== GridEngine : other useful commands ===== | ||
- | |||
- | ==== Checking job status ==== | ||
- | |||
- | * display job status of a specific user: | ||
- | |||
- | < | ||
- | |||
- | * display queues status (and list of queues): | ||
- | |||
- | < | ||
- | |||
- | * display the running jobs of all users: | ||
- | |||
- | < | ||
- | |||
- | * display the pending jobs of all users: | ||
- | |||
- | < | ||
- | |||
- | |||
- | * display the status of a job in progress: | ||
- | |||
- | < | ||
- | |||
- | * display the status of a job in progress with more details (longer): | ||
- | |||
- | < | ||
- | |||
- | * display the status of a job in progress with even more details (even longer): | ||
- | |||
- | < | ||
- | |||
- | |||
- | * display information on a job afetr its completion (long): | ||
- | |||
- | < | ||
- | |||
- | * delete a job: | ||
- | |||
- | < | ||
- | |||
- | * delete a job (force deletion) : | ||
- | |||
- | < | ||
- | |||
- | ==== Accounting ==== | ||
- | |||
- | <note important> | ||
- | |||
- | |||
- | * Job details for the last 30 days: | ||
- | |||
- | < | ||
- | |||
- | * CPU hours consumption (utime on the last 30 days): | ||
- | |||
- | < | ||
- | |||
- | ou | ||
- | |||
- | < | ||
- | |||
- | * CPU hours consumption(utime from date to date, in this example, year 2012): | ||
- | |||
- | < | ||
- | |||
- | ==== Troubleshootings: | ||
- | |||
- | Run the command: | ||
- | < | ||
- | |||
- | on the output, look at the last two columns: | ||
- | |||
- | * **aoACD** : Number of slots/cores that are at least in one of the following states: | ||
- | * **a** Load threshold alarm | ||
- | * **o** Orphaned | ||
- | * **A**** Suspend threshold alarm | ||
- | * **C Suspended by calendar | ||
- | * **D** Disabled by calendar | ||
- | |||
- | * **cdsuE** : Number of slots/cores that are at least in one of the following states: | ||
- | * **c** Configuration ambiguous | ||
- | * **d** Disabled | ||
- | * **s** Suspended | ||
- | * **u** Unknown | ||
- | * **E** Error | ||
- | |||
- | ==== Possible job status: ==== | ||
- | |||
- | * **d**(eletion), | ||
- | * **E**(rror), | ||
- | * **h**(old), | ||
- | * **r**(unning), | ||
- | * **R**(estarted), | ||
- | * **s**(uspended), | ||
- | * **S**(uspended), | ||
- | * **t**(ransfering), | ||
- | * **T**(hreshold), | ||
- | * **w**(aiting). | ||
- | |||
- | ===== GridEngine: Environment variables ===== | ||
- | |||
- | <note important>''# | ||
- | |||
- | * SGE_O_WORKDIR : directory where the job was submited, re-usable in scripts | ||
- | * NSLOTS : number of slots/cores requested | ||
- | * JOB_ID : job id (unique) assigned by GridEngine | ||
- | * JOB_NAME : name of the job (-N) | ||
- | * PE_HOSTFILE : hosts files (for MPI jobs) | ||
- | |||
- | |||
- | ===== References : ===== | ||
- | |||
- | * http:// | ||
- | * http:// | ||
- | * http:// | ||
- | * http:// | ||
- | * https:// | ||
- | * https:// | ||
- | * https:// | ||
- | * https:// | ||
- | * http:// | ||
- | * http:// | ||
- | * https:// | ||
- | * http:// | ||