Submitting a job
================

For those familiar with GridEngine, Slurm documentation provide a `Rosetta Stone for schedulers <https://slurm.schedmd.com/rosetta.pdf>`_, to ease the transition.

Slurm commands
--------------

:term:`Slurm` allows requesting resources and submitting jobs in a variety of ways. The main Slurm commands to submit jobs are:

* srun
    * Request resources and **runs a command** on the allocated compute node(s)
    * **Blocking**: will not return until the command ends

* sbatch
    * Request resources and **runs a script** on the allocated compute node(s)
    * **Asynchronous**: will return as soon as the job is submitted


.. TIP:: **Slurm Basics**

    .. _slurm_basics:

    * **Job**

    A Job is an allocation of resources (CPUs, RAM, time, etc.) reserved for the execution of a specific process:

        * The allocation is defined in the submission script as the number of Tasks (``--ntasks``) multiplied by the number of CPUs per Task (``--cpus-per-task``) and corresponds to the maximum resources that can be used in parallel,
        * The submission script, via ``sbatch``, creates one or more Job Steps and manages the distribution of Tasks on Compute Nodes.

    * **Tasks**

    A Task is a process to which are allocated the resources defined in the script via the ``--cpus-per-task``, ``--mem`` and ``--mem-per-cpu`` options. A Task can have these resources like any other process (creation of threads, of sub-processes possibly themselves multi-threaded).

    This is the Job's resource allocation unit. CPUs not used by a Task will be **lost**, not usable by any other Task or Step. If the Task creates more processes/threads than allocated CPUs, these threads will share the allocation.

    * **Job Steps**

    A Job Step represents a stage, or section, of the processing performed by the Job. It executes one or more Tasks. This division into Job Steps offers great flexibility in the organization of the steps in the Job and the management, and analysis, of the allocated resources:

        * Steps can be executed sequentially or in parallel,
        * one Step can initiate one or more Tasks, executed sequentially or in parallel,
        * Steps are tracked by the ``sstat/sacct`` commands, allowing both Step-by-Step progress tracking of a Job during it's execution, and detailed resource usage statistics for each Step (during and after execution).

    Using ``srun`` for a single task, inside a submission script, is not mandatory.

    * **Partition**

    A Partition is a logical grouping of Compute Nodes. This grouping makes it possible to specialize and optimize each partition for a particular type of job.

    See :doc:`computing_resources` and :doc:`partitions_overview` for more details.


.. _job_script:

Job script
----------

To run a job on the system you need to create a ``submission script`` (or job script, or batch script). This script is a regular shell script (bash) with some directives specifying the number of CPUs, memory, etc., that will be interpreted by the scheduling system upon submission.

* very simple

.. code-block:: bash

    #!/bin/bash
    #
    #SBATCH --job-name=test

    hostname -s
    sleep 60s

Writing submission scripts can be tricky, see more in :doc:`batch_scripts`. See also our `repository of examples scripts <https://github.com/ltaulell/submission_scripts>`_.


First job
---------

submit your job script with:

.. code-block:: console

    $ sbatch myfirstjob.sh
    Submitted batch job 623


:term:`Slurm` will return with a ``$JOBID`` if the job is accepted, else an error message. Without any options about output, it will be defaulted to ``slurm-$JOBID.out`` (slurm-623.out, with the above example), in the submission directory.

Once submitted, the job enters the queue in the *PENDING* (PD) state. When resources become available and the job has sufficient priority, an allocation is created for it and it moves to the *RUNNING* (R) state. If the job completes correctly, it goes to the *COMPLETED* state, otherwise, its state is set to *FAILED*.

.. TIP:: **You can submit jobs from any login node to any partition. Login nodes are only segregated for build (CPU µarch) and scratch access.**

Monitor your jobs
-----------------

You can monitor your job using either its name (``#SBATCH --job-name``) or its ``$JOBID`` with Slurm's ``squeue`` [#squeue]_ command:

.. code-block:: console

    $ squeue -j 623
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      623        E5     test ltaulell  R       0:04      1 c82gluster2

By default, ``squeue`` show every pending and running jobs. You can filter in your own jobs, using ``-u $USER`` or ``--me`` option:

.. code-block:: console

    $ squeue --me
    JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      623        E5     test ltaulell  R       0:04      1 c82gluster2

If needed, you can modify the output of ``squeue`` [#squeue]_. Here's an example (add CPUs to default output):

.. code-block:: console

    $ squeue --me --format="%.7i %.9P %.8j %.8u %.2t %.10M %.6D %.4C %N"
      JOBID PARTITION     NAME     USER ST       TIME  NODES CPUS NODELIST
      38956      Lake     test ltaulell  R       0:41      1    1 c6420node172

Usefull bash aliases:

.. code-block:: bash

    alias pending='squeue --me --states=PENDING --sort=S,Q --format="%.10i %.12P %.8j %.8u %.6D %.4C %.20R %Q %.19S"  # my pending jobs
    alias running='squeue --me --states=RUNNING --format="%.10i %.12P %.8j %.8u %.2t %.10M %.6D %.4C %R %.19e"  # my running jobs


Analyzing currently running jobs
--------------------------------

The ``sstat`` [#sstat]_ command allows users to easily pull up status information about their currently running jobs. This includes information about **CPU usage**, **task information**, **node information**, **resident set size (RSS)**, and **virtual memory (VM)**. You can invoke the ``sstat`` command as such:

.. code-block:: console

    $ sstat --jobs=$JOB_ID


By default, sstat will pull up significantly more information than what would be needed in the commands default output. To remedy this, you can use the `--format` flag to choose what you want in your output. See format flag in ``man sstat`` or ``sstat --helpformat``.

Some relevant variables are listed in the table below:

+---------------+----------------------------------------------------------+
| Variable      | Description                                              |
+===============+==========================================================+
| jobid         | The id of the Job.                                       |
+---------------+----------------------------------------------------------+
| avecpu        | Average CPU time of all tasks in job.                    |
+---------------+----------------------------------------------------------+
| averss        | Average resident set size of all tasks in job.           |
+---------------+----------------------------------------------------------+
| avevmsize     | Average virtual memory of all tasks in job.              |
+---------------+----------------------------------------------------------+
| maxrss        | Maximum resident set size of all tasks in job.           |
+---------------+----------------------------------------------------------+
| maxvmsize     | Maximum Virtual Memory size of all tasks in job.         |
+---------------+----------------------------------------------------------+
| MaxVMSizeNode | The node on which the maxvsize occurred.                 |
+---------------+----------------------------------------------------------+
| ntasks        | Number of tasks in a job.                                |
+---------------+----------------------------------------------------------+

For example, let's print out a job's average job id, cpu time, max rss, and number of tasks:

.. code-block:: console

    $ sstat --jobs=$JOB_ID --format=jobid,avecpu,maxrss,ntasks


You can obtain more detailed informations about a job using Slurm's ``scontrol`` [#scontrol]_ command. This can be very usefull for troubleshooting.

.. code-block:: console

    $ scontrol show jobid $JOB_ID

    $ scontrol show jobid 38956
    JobId=38956 JobName=test
    UserId=ltaulell(*****) GroupId=psmn(*****) MCS_label=N/A
    Priority=8628 Nice=0 Account=staff QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:08 TimeLimit=8-00:00:00 TimeMin=N/A
    SubmitTime=2022-07-08T12:00:20 EligibleTime=2022-07-08T12:00:20
    AccrueTime=2022-07-08T12:00:20
    StartTime=2022-07-08T12:00:22 EndTime=2022-07-16T12:00:22 Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-08T12:00:22
    Partition=Lake AllocNode:Sid=x5570comp2:446203
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=c6420node172
    BatchHost=c6420node172
    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=1,mem=385582M,node=1,billing=1
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryNode=385582M MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/home/ltaulell/tests/env.sh
    WorkDir=/home/ltaulell/tests
    StdErr=/home/ltaulell/tests/slurm-38956.out
    StdIn=/dev/null
    StdOut=/home/ltaulell/tests/slurm-38956.out
    Power=
    NtasksPerTRES:0

Kill a job
----------

For reasons, you might want to cancel a pending or running job:

.. code-block:: console

    $ scancel $JOB_ID


.. [#squeue] You can get the complete list of parameters by referring to the ``squeue`` manual page (``man squeue``).

.. [#scontrol] You can get the complete list of parameters by referring to the ``scontrol`` manual page (``man scontrol``).

.. [#sstat] You can get the complete list of parameters by referring to the ``sstat`` manual page (``man sstat``).