Ceci est une ancienne révision du document !
The job scheduler (or batch-queuing system) used in PSMN cluster is SGE,-previously Sun Grid Engine and now Son of Grid Engine-; it manages the execution of non-interactive jobs.
● Infrastructure:
To make the best use of resources, it is important to fill up the servers. For this, they are two strategies:
Filling at best quickly leads to a fragmentation between the servers, of parallel applications via mpi.
Filling at best is therefore only implemented on a few queues for parallel applications. On other queues, the fill in with the multiple of n cores of an entire server is used.
Job priority is:
This is to distribute the available resources more equitably.
The normal way to submit jobs to the cluster is using the qsub
command. For example:
qsub myscript.sh
The many options to the qsub
command are described in the manpage, man qsub
.
For example a more complex submission:
qsub -V -m b -m e -e /path/to/workdir/ -o /path/to/workdir/ -q $QUEUE script -V : export environment variables -m b : mail @begin -m e : mail @end -e : where to put error files -o : where to put output files -q : queue (file d'attente)
Nevertheless, it is easier to directly submit to GridEngine a script containing the desired options. You can follow the documentation on how to submit a job (full documentation). Moreover take a look at the list of available queues for submission.
Due to successive purchases of compute nodes with cores/CPU architectures of different generation, it was not possible to define a single queue. It is better to have different queues for each architecture, in order to achieve interesting performance for each queue.
In concrete terms, the choice of the “production” queue should be made according to the desired objective:
qstat -g c
should help you to chose the intended queue
Obviously, the above command (qstat -g c
) and the list of queues should guide your choice.
And, of course, when tuning ypur code, you have to choose a test queue that is closest to the intended “production” queue (i.e. same type of compute nodes). Eg r815lin128ib
was chosen for the production queue, you thus have to run your tests on r815_ib_test
.
qstat -u login
qstat -g c
qstat -q <queue_name> -f
qstat -u "*" -s r
qstat -u "*" -s p
qstat -j <job_id> | less
qstat -j <job_id> -g t | less
qstat -j <job_id> -g t -s r | less
qacct -j <job_id> -f /gridware/psmn/accounting | less
qdel <job_id>
qdel -f <job_id>
/gridware/psmn/accounting
qacct -f /gridware/psmn/accounting -d 30 -o <login> -j
qacct -f /gridware/psmn/accounting -d 30 -o <login> | tail -1 | awk '{print $3/3600}'
ou
qacct -f /gridware/psmn/accounting -q "*" -o <login> -d 30 | awk '{ SUM += $5} END {print SUM/3600}'
qacct -f /gridware/psmn/accounting -b 201201010000 -e 201212312359 -o <login> | tail -1 | awk '{print $3/3600}'
Run the command:
qstat -g c
on the output, look at the last two columns:
#$
is dedicated to GridEngine to transmit parameters (ex: #$ -cwd
or #$ -V
).