Skip to main content

ACCRE Commands for Job Monitoring

ACCRE staff have written a number of useful commands that are available for use on the cluster.

rtracejob

rtracejob is used to compare resource requests to resource usage for an individual job. It takes a job id as its single argument. For example:

[bob@vmps12 ~]$ rtracejob 1234567
+------------------+--------------------------+
|  User: bob       |      JobID: 1234567      |
+------------------+--------------------------+
| Account          | chemistry                |
| Job Name         | python.slurm             |
| State            | Completed                |
| Exit Code        | 0:0                      |
| Wall Time        | 00:10:00                 |
| Requested Memory | 1000Mc                   |
| Memory Used      | 13712K                   |
| CPUs Requested   | 1                        |
| CPUs Used        | 1                        |
| Nodes            | 1                        |
| Node List        | vmp505                   |
| Wait Time        | 0.4 minutes              |
| Run Time         | 0.4 minutes              |
| Submit Time      | Thu Jun 18 09:23:32 2015 |
| Start Time       | Thu Jun 18 09:23:57 2015 |
| End Time         | Thu Jun 18 09:24:23 2015 |
+------------------+--------------------------+
| Today's Date     | Thu Jun 18 09:25:08 2015 |
+------------------+--------------------------+

rtracejob is useful for troubleshooting when something goes wrong with your job. For example, a user might want to check how much memory a job used compared to how much was requested, or how long it took a job to execute relative to how much wall time was requested. In this example, note the Requested Memory reported is 1000Mc, meaning 1000 megabytes per core (the “c” stands for “core”). This is the default for jobs that specify no memory requirement. If you see a lowercase “n” on the Requested Memoryline, this stands for “node” and occurs when a --mem= line is included in a SLURM script, which allocates the amount of memory listed per node in the allocation.

qSummary

qSummary provides an alternate summary of jobs and cores running across all groups in the cluster. It is possible to filter the results by selecting a specific account through the -g option.

[jill@vmps12 ~]$ qSummary
GROUP      USER        ACTIVE_JOBS  ACTIVE_CORES  PENDING_JOBS  PENDING_CORES
-----------------------------------------------------------------------------
science                    18            34             5             7
           jack             5             5             4             4
           jill            13            29             1             3
-----------------------------------------------------------------------------
economics                  88           200           100           100
           emily           88           200           100           100
-----------------------------------------------------------------------------
Totals:                   106           234           105           107

As shown, the output from qSummary provides a basic view of the active and pending jobs and cores across groups and users within a group. qSummary also supports a -g argument followed by the name of a group, a -p argument followed by the partition name, and a -gpu switch if you like to see GPU rather than CPU info. For example:

 
[jill@vmps12 ~]$ qSummary -p pascal -gpu
GROUP      USER        ACTIVE_JOBS  ACTIVE_GPUS  PENDING_JOBS  PENDING_GPUS
-----------------------------------------------------------------------------
science                     4             8             1             2
           jack             0             0             1             2
           jill             4             8             0             0
-----------------------------------------------------------------------------
economics                   4            16             0             0
           emily            4            16             0             0
-----------------------------------------------------------------------------
Totals:                     8            24             1             2

showLimits

As the name suggests, showLimits will display the resource limits imposed on accounts and groups on the cluster. Running the command without any arguments will list all accounts and groups on the cluster. Optionally, showLimits also accepts a -g argument followed by the name of a group or account. For example, to see a list of resource limits imposed on an account named science_account (this account does not actually exist on the cluster):

[jill@vmps12 ~]$ showLimits -g science_account
ACCOUNT         GROUP       FAIRSHARE   MAXCPUS   MAXMEM(GB)  MAXCPUTIME(HRS)
-----------------------------------------------------------------------------
science_account                12        3600       2400           23040
               biology          1        2400       1800               -
               chemistry        1         800        600               -
               physics          1         600        600            8640
               science          1           -       2200           20000
-----------------------------------------------------------------------------

Limits are always imposed on the account level, and occasionally on the group level when multiple groups fall under a single account. If a particular limit is not defined on the group level, the group is allowed access to the entire limit under its parent account. For example, the science group does not have a MAXCPUS limit defined, and therefore can run across a maximum of 3600 cores so long as no other groups under science_account are running and no other limits (MAXMEM or MAXCPUTIME) are exceeded.

We leave FAIRSHARE defined on the account level only, so groups within the same account do not receive elevated priority relative to one another. The value 1 for FAIRSHARE defined at the group level means that all groups under the account receive equal relative priority.

SlurmActive

SlurmActive displays a concise summary of the percentage of CPU cores and nodes currently allocated to jobs, and the number of memory-starved CPU cores on the cluster. For GPU accelerated nodes it will show the number of allocated GPUs.

[bob@vmps12 ~]$ SlurmActive

Standard Nodes Info:      564 of   567 nodes active                     ( 99.47%)
                         5744 of  7188 processors in use by local jobs  ( 79.91%)
                          253 of  7188 processors are memory-starved    (  3.52%)
                         1191 of  7188 available processors             ( 16.57%)

GPU Nodes Info:         Pascal:  18 of 47 GPUs in use                   ( 38.30%)
                        Maxwell: 18 of 48 GPUs in use                   ( 37.50%)

Phi Nodes Info:           0 of   0 nodes active                         (  0.00%)
                          0 of   0 processors in use by local jobs      (  0.00%)
                          0 of   0 processors are memory-starved        (  0.00%)

ACCRE Cluster Totals:     576 of   591 nodes active                     ( 97.46%)
                         5813 of  7428 processors in use by local jobs  ( 78.26%)
                          253 of  7428 processors are memory-starved    (  3.41%)
                         1362 of  7428 available processors             ( 18.34%)

2387 running jobs, 2519 pending jobs
, 7 jobs in unrecognized state

Multiple sections are reported. In general, the Standard Node Info section is the one users are most interested in, as this corresponds to the default production partition on the ACCRE cluster. GPU Node Info provides information about the availability of GPU nodes on the cluster, while the Phi Node Info section provides details about the availability of the Intel Xeon Phi nodes.

SlurmActive also reports the number of memory-starved cores in each section. A core is considered memory-starved if it is available for jobs but does not have access to at least 1GB of RAM (by default, jobs are allocated 1GB RAM per core). Requesting less than 1GB of RAM per core may provide access to these cores. Note that SlurmActive accepts a -m option followed by the amount of RAM (in GB) if you would like to compute memory-starved cores on the basis of another memory value. For example, SlurmActive -m 2will report cores as being memory-starved if they do not have access to at least 2GB of RAM.