You are here

Job Monitoring

There are a number of commands that can be use to monitor Jobs queues and machines their jobs run on (CLICK HERE for a Summary Sheet of common job monitoring commands).

The qstat command is used to show a summary of jobs on a cluster.

The checkjob comand shows detailed information for a single job.

The mdiag -n command shows a summary of the condition of machines in a cluster.  It is generally for use by administrators and requires special permissions.

qstat

The qstat command shows a limited amount of information on a number of jobs in the scheduling system.

The "qstat -a" command shows information on all queued jobs.
The "qstat -r" command shows information on all running jobs.
The "qstat -i" command shows information on all non-running jobs.

The "qstat -f [job_identifier]" command shows a full status display of the given job.  See also "checkjob" below.


Job ID refers to the job identifier assigned by PBS.
Username refers to the job owner.
Queue refers to the queue in which the job currently resides.
Jobname refers to the job name given by the owner.
SessID refers to the session id (if the job is running).
NDS refers to the number of nodes requested by the job.
TSK refers to the number of cpus or tasks requested by the job.
Req’d Memory is the amount of memory requested by the job.
Req’d Time is the wall time requested by the job (hh:mm).
S refers to the jobs current state:
   E – Job is exiting
   H – Job is held.
   Q – Job is queued.
   R – Job is running.
Elap Time is the Elapsed time since the job has started (hh:mm).

checkjob

The checkjob command shows a large amount of detailed information for a single job.
The checkjob command is invoked "checkjob <jobid>", the -v flags can be added for more detail.

Class refers to the queue that the job is currently running in.
State refers to the state of the job: Idle, Starting, Running
Time Queued Total refers to the amount of time job spent in the queue.
Time Queued Eligible refers to the amount of queue time that is eligible for consideration when job priority based on queue time is calculated.
If a user submits a lot of jobs it is possible that only first few jobs may be gaining eligible queue time.
Required Hostlist is a list of hosts on one or more of which the job must run.
Reserved Nodes is the list of hosts that the job is currently reserved to run on.
StartPriority is the current priority of the job, the higher the better.
The end of the “checkjob -v –v” command lists reasons that the job is not being started on each node.
There may and usually is more than on reason the job is not being started on a node, only one reason is shown here.

mdiag -n

The "mdiag -n" command shows a summary of the actual and current condition of machines in cluster.  It is only available to administrators.

This command is useful on large SMP machines or small clusters where one can get an overview of what is happening on the cluster in a single page.

The State field refers to state each machine is in: Idle, Busy or Down.
The Name field refers to the machine (cluster node) name.
The first number in the Procs field is number of currently empty processors and the second number after the colon is the total processors available on the machine in question.