1998 ASR Home
Back
SCD ASR Index
Next
SCD Home

Unified Queuing System and BPS

The primary purpose of the Unified Queuing System project is to establish a consistent batch environment for all systems serving as computational engines within the NCAR environment. In addition, the Unified Queuing System project is designed to establish a flexible, priority-based job queuing and scheduling structure that can be easily adapted to the evolving needs of the NCAR user communities.

Background

Since the UNIX-based UNICOS system was adopted as the operating system on the Cray computers at NCAR, the Network Queuing System (NQS) and its successor product, the Network Queuing Environment (NQE), have been used for batch job scheduling and resource allocation. NQS has become a common product supplied by high-end computer vendors for managing batch jobs. Because of this, SCD believes that as new computational systems are installed and old systems are decommissioned, NQS (or similar batch queuing subsystems) will provide a common subsystem for maintaining some form of continuity in the user environment.

SCD has developed and implemented a layer of software named BPS (Batch Priority Scheduler) to provide enhanced job scheduling capabilities that have not been available in vendor-released NQS systems. BPS provides a number of features that include priority-based pre-emptive job load management, time-based job queue scheduling, and the ability to schedule jobs in a round-robin fashion within the priority-ordered queues. Round-robin scheduling uses NCAR-defined projects, or proposals, to fairly distribute limited computational resources. BPS also gives SCD the ability to respond to special scheduling needs; for example, accommodating jobs with unusually large resource requirements that might not be able to run effectively in a normal job mix.

The Unified Queuing System project unifies the batch processing user interface across all of the computational platforms at NCAR with a common set of queues and job submission rules for all users. This project uses the NQE batch subsystem, front-ended by the SCD-developed BPS system, to meet the unique, demanding, and evolving computational requirements of the NCAR user community.

Standardized queues

A common set of priority-based queues was established on the Parallel Vector Processor (PVP) systems to which users submit batch jobs; the BPS and NQE subsystems schedule system resources based on queue priority. There are four basic queues: "prem" (premium), "reg" (regular), "econ" (economy), and "sb" (standby). The regular queue is meant to be the primary target of most work, and should provide timely job turnaround. The other queues are available for users to selectively discriminate the priority, or urgency, of their work. Premium queue jobs receive top-priority scheduling and pre-empt lower-priority work, if necessary, so that sufficient system resources are made available. Economy queue jobs are allowed to run when regular and higher-priority work does not require pre-emptive access to system resources. Standby queue jobs receive system resources only when little or no other higher-priority work exists. Subqueues are implemented on some systems (in particular, the Community supercomputers chipeta, ouray, and paiute) to establish system-unique scheduling policies such as giving higher priority to large-memory jobs during the nighttime hours.

An example of this new queue structure, that for the Cray J90se systems (ouray and chipeta), is provided in the following table. On these systems, users submit jobs to one of the four production job submission queues, NQS then categorizes the job into one of eight subqueues based on priority and requested memory. Thus a policy of, for example, giving preference to large-memory jobs during nighttime hours can be easily implemented through BPS and NQS.

In addition to the four basic job queues, the Unified Queuing System also recognizes the need for occasional special projects. All systems have a "spec" (special) queue and an "nd" (near-dedicated) queue. The ability of a user to submit work to these queues is enabled and disabled based on SCD policy decisions and special, refereed requirements. Limits for these special queues are relatively unrestrictive, thus allowing a single application access to the full resources of a given system.

Queue structure for the Cray J90se systems (ouray and chipeta)

   User Job              Memory
  Submission     NQS      Limit
    Queue      Subqueue   (MW)        CPU Limit (sec)

    spec**      spec_     1000        864000 (10 hours * 24 CPUs)

     nd**        nd_      1000        864000 (10 hours * 24 CPUs)

     prem        pr_       64         864000 (10 hours * 24 CPUs)

                pr_Mx      275        864000 (10 hours * 24 CPUs)

     reg         rg_       64         864000 (10 hours * 24 CPUs)

                rg_Mx      275        864000 (10 hours * 24 CPUs)

     econ        ec_       64         864000 (10 hours * 24 CPUs)

                ec_Mx      275        864000 (10 hours * 24 CPUs)

      sb         sb_       64         864000 (10 hours * 24 CPUs)

                sb_Mx      275        864000 (10 hours * 24 CPUs)

-----------------------------------------------------------------

** Available for special projects

-----------------------------------------------------------------

Enhancements for Distributed Shared Memory (DSM) architectures

During FY1998, SCD had planned to implement the Unified Queue Structure and the Batch Priority Software on our Hewlett Packard SPP-2000 Exemplar system (sioux) as we transitioned that system into the Community user environment. We had also intended to implement the Unified Queue Structure and the Batch Priority Software on our new Silicon Graphics Cray Research Origin2000 system (ute). However, after extensive testing and analysis, SCD determined that the Unified Queue Structure and the inherent priority-based, preemptive scheduling algorithm used on PVP systems would not run on the DSM systems. This was primarily due to numerous missing or unreliable operating system features in these new systems, such as job checkpoint/restart, high-speed swap filesystems and efficient CPU scheduling algorithms (such as gang and affinity scheduling) in the kernel. Without these supercomputer environment enhancements in the operating system, SCD was forced to substantially rethink methodologies for scheduling work on DSM architectures.

Due to the major differences in how the DSMs need to be scheduled compared with the PVPs, major algorithms in BPS had to be rewritten. SCD developed a new queue structure that better suits the DSM systems, and a new scheduler called Batch Dedicated Scheduler (BDS) was implemented.

DSM queue structures have been set up to allocate groups of processors to jobs, as shown in the following table, and job scheduling algorithms in BDS have been written to throttle the number of jobs so that the total number of executable threads does not exceed the total number of physical processors on the machine. This latter is referred to as reducing or eliminating processor "oversubscription." Testing and evaluation of existing DSM systems has demonstrated a remarkable degradation in individual application performance when processors are oversubscribed.

Queue structure for the Silicon Graphics Cray Origin2000 (ute)

                       Processor
   User Job              count    Memory
  Submission     NQS     Limit    Limit
    Queue     Subqueue    (#)      (MW)   CPU Limit (sec)

    spec**      spec_     128      none        none

   share_16   shar_16_    16       none     346,000 (6 hours * 16 CPUs)

    ded_16     ded_16_    16       none     346,000 (6 hours * 16 CPUs)

    ded_32     ded_32_    32       none     691,200 (6 hours * 32 CPUs)

    ded_64     ded_64_    64       none   1,382,400 (6 hours * 64 CPUs)

   res_64**    res_64_    64       none   1,382,400 (6 hours * 64 CPUs)

-----------------------------------------------------------------------

** Available for special projects

-----------------------------------------------------------------------

BDS was implemented to reduce processor oversubscription while attempting to keep the system highly utilized and maximize system throughput. Because of competing requirements, BDS has been enhanced to allow a flexible, priority-based process for selecting which jobs to run. In addition, it has maintained the round-robin-by-proposal algorithm for selecting those jobs from the input queue.

At the end of FY1998, BDS was running on the Silicon Graphics Cray Origin2000 (ute) in production, and within the limited capabilities provided by the host Irix operating system, providing a mechanism of controlling the job load on ute to allow jobs near-dedicated access to system resources and enforcing a 'fair-sharing' of the system among the six projects given allocations.

1998 ASR Home
Back
SCD ASR Index
Next
SCD Home