About the Scheduler
WestGrid Cluster Scheduler Configuration Options
Rob Simmonds, December 19, 2005
This document provides a very brief overview of some of the functionality of the Moab scheduler used at the WestGrid (WG) compute sites. Moab has many features which will not be covered here. The Moab documentation can be found at:
http://www.clusterresources.com/products/maui/docs/mauiadmin.shtml .
In general a job is started by picking the first job in a list of pending jobs that resources are available for and starting it. The scheduler continues down the list on each scheduler cycle and starts any jobs for which resources are available; in fact the number of jobs started per cycle may be limited by a parameter setting to prevent the scheduler becoming unresponsive. The list of jobs considered by the scheduler is sorted, with the ordering depending on weightings that are set in the schedulers configuration file.
The main priority mechanism used on WG systems is fair share (FS). This can be configured in many ways, but we base the FS decisions on the use by the members of an accounting group (AG). The FS values are computed over a number of windows where the use in a previous window has less effect than in a later window. Our systems are configured with 2 day windows and with 7 windows. Therefore after two weeks without using a system an AG’s FS use value will be zero. AGs have a FS target and the closer the AG’s use gets to their FS target the less priority the jobs belonging to this AG get. Note that the FS target has to be greater than zero to have ANY effect on the AG’s priority (i.e., if an AG’s FS target is set to zero it will not be part of the fair share calculations).
There are a number of different ways that a FS target can work. Either priority is boosted when below the FS target and reduced when above it, or the target can be set so that only one of these, a boost or a reduction occurs. These options can be set per AG. Also the weight that is used to boost/reduce priorities can be set per AG. Currently on WG systems we set a higher weight on FS targets for AGs that have RAC allocations. We also don’t configure RAC allocated AG’s to have their priority reduced when they go over their FS target; this can be changed at the discretion of the RAC.
There are other factors that can effect which jobs can start. Ones that are most relevant are reservations, restrictions and backfill.
Reservations can be configured in many ways. They are either applied to set of specific resources, to a set of general resources, or even to virtual resources. For allocation of CPUs the first two of these are relevant. We can place a reservation of specific systems (e.g., nodes xc001-xc064) or on a number of processes (e.g., 64 nodes). The advantage of the latter case is that if we were mapping 128 processor jobs to the reservation, if one of
the nodes fails in the first case, no 128 processor job mapped to this reservation could run. In the second case, when another node becomes available it is added to the set of reserved nodes.
Jobs can be mapped to reservations in a large number of ways. Reservations are usually configured so that a certain number of Quality of Service (QoS) configurations map to the reservation and the QoS has rules that allow jobs to map to them. Examples could be jobs using a specified range of processor counts, members of particular AGs, a list of specific users or jobs needing particular resources, such as access to a particular interconnect. A specific example would be to have a QoS that accepts 128 processor jobs requesting up to 24 hours of wall clock time that only accepts jobs submitted by user U102 and user U342. Another QoS could be created that accepts jobs from all users but has a maximum wall clock limit of 10 minutes. Both of these QoS configurations could then be mapped to a single reservation that keeps 128 processors available for jobs that fit into one of these two QoS configurations, with the order in which they start based on their priority. The QoS configurations are very flexible and there are many other properties that could be matched.
Reservations can also be made with simpler mapping rules not involving a QoS configuration. For example a reservation can be made so that only jobs belonging to a single user can run. If a reservation was installed such that at starting from midnight, only jobs belonging to user U103 can run, then any normal job that is submitted by another user that has a wall clock length that would mean that if started, it could run until after midnight, would not start. Jobs belonging to user U103 could start since they would be able to run during the reserved time. When the system is going to be taken down for maintenance it is usual to install a reservation to stop any job that could run until after the shutdown time from starting.
It should be noted that the Moab scheduler has recently been modified so that a “restartable” job could run even if its run time would overlap with a reservation that is cannot run in. In this case the job can be given a minimum wall clock time and it will start if this minimum time is available. The job will be killed when the reservation time is reached. This scheduling option needs more experimentation before it is recommended to users and is really intended for filling in holes that would not be filled using normal jobs.
Another option in the scheduler configuration is constraints, most of which can be applied per user, per AG or per QoS (note that not all constraints work on all object types). A typical constraint is that a user can use a maximum of X processors at a time, or that they can run a maximum of Y jobs at a time. There is an option of constraints to take two values. One that is observed when a system busy and a second value that is used if
the system is not fully utilized. For example you could allow a user to use a maximum of 128 processors if there are no other jobs waiting to be run, or 96 if there are jobs from other users waiting to run.
The final scheduling option that is explained in this document is backfill. The scheduler can be configured such that when a parallel job is ready to be started, a reservation is inserted for the job. This reservation gathers enough processors for the job to run and can lead to some processors being idle while waiting for all the required processors to come available. The number of jobs that can have outstanding reservations is controlled by a scheduler parameter and can be applied to all jobs or to jobs that are mapped to a particular QoS setting. Backfill is used to start short jobs (or restartable jobs) on the processors that would otherwise be idle waiting for the parallel jobs to start.
