Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentesRévision précédente
Prochaine révision
Révision précédente
en:documentation:tools:sge [2018/09/04 11:56] cpetiten:documentation:tools:sge [2023/12/12 12:55] (Version actuelle) – supprimée ltaulell
Ligne 1: Ligne 1:
-====== GridEngine ====== 
- 
- 
-The job scheduler (or batch-queuing system) used in PSMN cluster is SGE,-previously Sun Grid Engine and now Son of Grid Engine-; it manages the execution of non-interactive jobs. 
- 
-● __Infrastructure__:  
- 
- 
-  * the **compilation servers** are described [[documentation:clusters:services| here]] 
-  * the **cluster hardware configuration** is described [[documentation:clusters:hardware| here]] 
-  * the **queues for job submission** are described [[documentation:clusters:batch|here]] 
-{{ :salle_des_machines_2.png?nolink&400 |}} 
-==== Optimum use of resources ==== 
- 
-To make the best use of resources, it is important to fill up the servers. For this, they are two strategies: 
- 
- 
-  * fill in "at best" (a best effort strategy),  
-  * fill in with multiple of n cores (where n is the number of physical cores per server). 
- 
-Filling at best quickly leads to a fragmentation between the servers, of parallel applications via mpi. 
- 
-Filling at best is therefore only implemented on a few queues for parallel applications. On other queues, the fill in with the multiple of n cores of an entire server is used. 
- 
-==== Priorities ==== 
- 
-Job priority is: 
- 
-  * inversely proportional to the calculation time already consumed, 
-  * proportional to waiting time and number of hearts requested. 
- 
-This is to distribute the available resources more equitably. 
- 
-===== GridEngine : Submitting jobs ===== 
- 
-The normal way to submit jobs to the cluster is using the ''qsub'' command. For example: 
- 
-<code bash> 
-qsub myscript.sh 
-</code> 
- 
-The many options to the ''qsub'' command are described in the manpage, ''man qsub'' 
- 
-For example a more complex submission: 
-<code bash> 
-qsub -V -m b -m e -e /path/to/workdir/ -o /path/to/workdir/ -q $QUEUE script 
- 
--V : export environment variables 
--m b : mail @begin 
--m e : mail @end 
--e : where to put error files 
--o : where to put output files 
--q : queue (file d'attente) 
-</code> 
- 
-** Nevertheless, it is easier to directly submit to GridEngine a script containing the desired options**. You can follow the documentation on [[documentation:tutorials:submit|how to submit a job (full documentation)]]. Moreover take a look at [[documentation:clusters:batch&#les_files_d_attente|the list of available queues for submission]]. 
- 
-==== How to choose the adapted queues for my needs? ==== 
- 
-Due to successive purchases of compute nodes with cores/CPU architectures of different generation, it was not possible to define a single queue. It is better to have different queues for each architecture, in order to achieve interesting performance for each queue. 
- 
- 
-In concrete terms, the choice of the "production" queue should be made according to the desired objective: 
- 
-  * if the main criterion is the speed of execution, you must look at what are the queues available to accept the job. The use of commands of the type ''qstat -g c'' should help you to chose the intended queue 
-  * if the main criterion is the large number of resources (//eg// a job with a lot of cores, a job with a lot of RAM,etc), then you have to move towards the queues that have a large number of resources (at least the resources requested by the job), even if waiting time in the queue is greater. 
- 
-Obviously, the above command (''qstat -g c'') and [[documentation:clusters:batch&#les_files_d_attente|the list of queues]] should guide your choice. 
- 
-And, of course, when tuning ypur code, you have to choose a test queue that is closest to the intended "production" queue (//i.e.// same type of compute nodes). //Eg// ''r815lin128ib'' was chosen for the production queue, you thus have to run your tests on ''r815_ib_test''. 
-===== GridEngine : other useful commands ===== 
- 
-==== Checking job status ==== 
- 
-  * display job status of a specific user:  
- 
-<code>qstat -u login </code> 
- 
-  * display queues status (and list of queues):  
- 
-<code>qstat -g c </code> 
- 
-    * display the running jobs of all users:  
- 
-<code>qstat -u "*" -s r </code> 
- 
-  * display the pending jobs of all users:  
- 
-<code>qstat -u "*" -s p </code> 
- 
- 
-  * display the status of a job in progress:  
- 
-<code>qstat -j <job_id> | less </code> 
- 
-  * display the status of a job in progress with more details (longer):  
- 
-<code>qstat -j <job_id> -g t | less </code> 
- 
-  * display the status of a job in progress with even more details (even longer): 
- 
-<code>qstat -j <job_id> -g t -s r | less </code> 
- 
- 
-  * display information on a job afetr its completion (long): 
- 
-<code>qacct -j <job_id> -f /gridware/psmn/accounting | less</code> 
- 
-  * delete a job:  
- 
-<code>qdel <job_id> </code> 
- 
-  * delete a job (force deletion) :  
- 
-<code>qdel -f <job_id> </code> 
- 
-==== Accounting ==== 
- 
-<note important>The accounting file is distributed on ''/gridware/psmn/accounting''</note> 
- 
- 
-  * Job details for the last 30 days: 
- 
-<code>qacct -f /gridware/psmn/accounting -d 30 -o <login> -j </code> 
- 
-  * CPU hours consumption (utime on the last 30 days): 
- 
-<code>qacct -f /gridware/psmn/accounting -d 30 -o <login> | tail -1 | awk '{print $3/3600}'</code> 
- 
-ou 
- 
-<code>qacct -f /gridware/psmn/accounting -q "*" -o <login> -d 30 | awk '{ SUM += $5} END {print SUM/3600}'</code> 
- 
-  * CPU hours consumption(utime from date to date, in this example, year 2012): 
- 
-<code>qacct -f /gridware/psmn/accounting -b 201201010000 -e 201212312359 -o <login> | tail -1 | awk '{print $3/3600}'</code> 
- 
-==== Troubleshootings: ==== 
- 
-Run the command: 
-<code>qstat -g c </code> 
- 
-on the output, look at the last two columns: 
- 
-  * **aoACD** : Number of slots/cores that are at least in one of the following states: 
-    * **a** Load threshold alarm  
-    * **o** Orphaned  
-    * **A**** Suspend threshold alarm  
-    * **C Suspended by calendar  
-    * **D** Disabled by calendar 
- 
-  * **cdsuE** : Number of slots/cores that are at least in one of the following states: 
-    * **c** Configuration ambiguous  
-    * **d** Disabled  
-    * **s** Suspended  
-    * **u** Unknown  
-    * **E** Error 
- 
-==== Possible job status: ==== 
- 
-  * **d**(eletion), 
-  * **E**(rror), 
-  * **h**(old), 
-  * **r**(unning), 
-  * **R**(estarted), 
-  * **s**(uspended), 
-  * **S**(uspended), 
-  * **t**(ransfering), 
-  * **T**(hreshold), 
-  * **w**(aiting). 
- 
-===== GridEngine: Environment variables ===== 
- 
-<note important>''#$'' is dedicated to GridEngine to transmit parameters (ex: ''#$ -cwd'' or ''#$ -V'').</note> 
- 
-  * SGE_O_WORKDIR : directory where the job was submited, re-usable in scripts 
-  * NSLOTS : number of slots/cores requested 
-  * JOB_ID : job id (unique) assigned by GridEngine 
-  * JOB_NAME : name of the job (-N) 
-  * PE_HOSTFILE : hosts files (for MPI jobs) 
- 
- 
-===== References : ===== 
- 
-  * http://wiki.gridengine.info/wiki/index.php/Utilities 
-  * http://www.cac.cornell.edu/Ranger/Environment/more_cmds.aspx 
-  * http://cc.in2p3.fr/docenligne/1007 
-  * http://www.blog.kubiak.co.uk/post/53 
-  * https://wiki.duke.edu/display/SCSC/SGE+Queueing+System 
-    * https://wiki.duke.edu/display/SCSC/SGE+Env+Vars 
-    * https://wiki.duke.edu/display/SCSC/SGE+Array+Jobs :!: 
-    * https://wiki.duke.edu/display/SCSC/SGE+Job+Dependencies :!: 
-  * http://stackoverflow.com/questions/4883056/sge-qsub-fails-to-submit-jobs-in-sync-mode 
-  * http://wiki.ibest.uidaho.edu/index.php/Tutorial:_Creating_dependent_jobs :!: 
-  * https://sites.google.com/site/anshulkundaje/inotes/programming/clustersubmit/sun-grid-engine :!: 
-  * http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html?pathrev=V62u5_TAG (ENVIRONMENT VARIABLES) 
  
en/documentation/tools/sge.1536062200.txt.gz · Dernière modification : 2020/08/25 15:58 (modification externe)