Richards Center at Yale University
RC Home | Search | Table of Contents | General Information

Last Modified: Wednesday, 30-Apr-2003 15:57:45 EDT

Using Batch Queues in the CSB Core

The CSB policy on batch queues and compute-intensive jobs requires that batch queues be used in many instances. See the Batch Queues and Compute-Intensive Jobs page of the Policies and Practices section.

Summary of Changes March 20, 2003

  • The command file that gets submitted to DQS (as in, qsub commandfile) must be located in /srv/people, /srv/temp or some other permanently-mounted filesystem. It may not be located in a /csb/$groupname/$N filesystem.
  • Queues have been added on Thor, an EV6 Alpha (like Ajax).
  • Sponge queues have been removed from all slower, EV5 computers.

    Summary of Changes for DQS 3.3.2, May 17, 2001

  • The categories fast and safe have been eliminated.
  • Categories ev5 and ev6 have been added. The fastest Alpha computers, janus and ajax, are ev6, all others are ev5. The category alpha matches any Alpha computer.
  • The qdel command works.

    What are Batch Queues?

    Batch queues are implemented in the CSB Core through the Distributed Queueing System (DQS). Batch queues allow one or more computer programs to be combined into a job, which is executed on one of several computers. Batch queues provide the following features:

  • Jobs can be started (submitted), monitored and controlled from any of the CSB unix computers, regardless of where they will be executed.
  • You can allow the DQS system to guess which computer is the best place to run your job.
  • Your job runs at an appropriate priority, guaranteeing an equitable distribution of computing time.
  • For most jobs which can be run using DQS, CSB policy requires that you use DQS.
  • Using Batch Queues in the CSB Core

  • What Types of Queues are Available?
  • Where Should I Run My Job?
  • Using qload to Find the Best Queue
  • What Specific Queues are Available?
  • Setting up a DQS Job
  • Common DQS Commands
  • Miscellaneous Information
  • How to Get More Information
  • What Types of Queues are Available?

    When you submit a job to the DQS, it selects a queue on a particular host based on what you tell it about the resources your job needs. There are several categories of resources:

    1. Maximum CPU time -- the resources short, medium, long and sponge specify that your job will take no more than 1 hour, 8 hours, 5 days  and 21 days respectively on all servers. On a given host, short jobs will be allocated more time than medium jobs, which will get more time than long jobs. But not by much. Sponge jobs, however, will generally get no time if any other category is also running. The DQS system will kill jobs which exceed the stated CPU time limit. Note that the time limits are the same for all hosts, regardless of CPU speed, except for queues on the EV6 Alphas (janus and ajax. These, being twice as fast, have half the time limits.
    2. Architecture -- These resources indicate what kind of computer your job is able to run on. Specify sgi if you want your job to run on an SGI computer; specify alpha for a DEC Alpha running Digital Unix. ev5 specifies the older Alpha processors; ev6 is for the new Alphas (about twice the speed of the ev5). linux is for PC/linux computers.
    3. Obsolete -- The catgories safe and fast are obsolete and have been removed.

    To submit a job named test.com to a medium queue on any Alpha computer, you type:

    qsub -l medium,alpha test.com

    For other examples, see the section, Common DQS Commands.

    It is important to correctly specify the resource(s) that you need. For example, if you type, qsub -l med ... instead of qsub -l medium ... your job will wait for a queue with the resource med to become available. Since no such queues exist, your job will never be started.

    Back to Top

    Where Should I Run My Job?

    Usually, you should specify -l alpha to run your job on a computer with alpha architecture. These are fastest for most jobs, and have the most queues available. Exceptions are if your job requires either SGI (-l sgi) or Linux (-l linux) architecture.

    Usually, you will want to run your job in a queue with the shortest time limit, and therefore the highest priority. IE, if you know your job will take about 1 day, you will want to run it in a long queue. The simplest thing, if your job is set up to run on an alpha, is to use the command,

    qsub -l long,alpha (jobname)
    

    The DQS system will try to pick out the queue which will get your job done fastest. However, there are some limitations to DQS' ability to do so. So you may want to specify that your job run on one of the fastest computers:

    qsub -l long,ev6 (jobname)
    

    Of course, that could backfire if there are long waiting lines on all the fast servers, while nobody is using the ev5 computers.

    Finally, you may want to use qload to examine the status of all queues and the loads on each server, then choose exactly which server you want to use.

    Back to Top

    Using qload to select a queue

    Usually, you want to run on the fastest computer. Also, if your job is big, you want the computer with the largest available memory. Within those constraints, you want the computer with the least competition, ie where the fewest other jobs are running.

    You can determine the speed and memory of each computer by looking at the table in the next section. Now (April, 2000), we have introduced a simple new script, qload to let you know what's available. Suppose you want to run a job in a long queue. The following example shows the qload command and the output from it.

    qload long

    Getting qstat... this can take up to 5 seconds.
         used/  
    load avail queue
    ---- ----- ----------
    0.00 0/1   long-artemis
    0.00 0/1   long-athena
    0.00 0/1   long-atlas
    0.00 0/1   long-hercules
    0.00 0/2   long-phe
    0.97 1/1   long-ajax
    1.00 0/2   long-janus
    2.12 1/1   long-darwin
    

    The first column shows the current load on each computer with a long queue. Most have no load. Janus has a load of 1, but since it has 2 CPUs (which you know by reading the next section), it also has an available CPU.

    The second column shows how many long queue slots are available and used on each computer. Note that even if you wanted to run your job on ajax, sharing the CPU with the currently-running job, you would have no luck, because its single long queue slot is occupied.

    In this example, your best bet is to specify long-janus. Janus is the fastest computer (slightly faster than ajax), and you will have one of its two CPUs all to yourself (at least until someone else submits a job there).

    Back to Top

    What Specific Queues are Available?

    These queues are available, as of 14 March, 2003. However, the configuration changes faster than the documentation. For a list of active queues, use the command, qstat -f.
     

    DQS Batch queues available in CSB Core. For speeds, higher numbers are faster.
    Host Type
    [1]
    Speed/ 
    #CPUs
    Mem/ 
    (MB)
    short 
    (1 hr)
    medium 
    (8 hrs)
    long 
    (5 days)
    sponge
    (21 days)
    hercules ev5 4/1 1024 short-hercules -- long-hercules --
    bhima ev5 4/1 1024 short-bhima medium-bhima -- --
    atlas ev5 4/1 1024 -- medium-atlas long-atlas --
    ajax ev6 6/1 1536 -- medium-ajax long-ajax sponge-ajax
    thor ev6 6/1 1536 -- medium-thor long-thor sponge-thor
    janus ev6 6/2 2048 short-janus medium-janus long-janus sponge-janus
    emperor[2] linux 3/2 2048 short-emperor medium-emperor -- --
    darwin sgi 2/2 384 -- medium-darwin long-darwin --

    [1] -- ev5 and ev6 queues are also alpha queues.
    [2] -- emperor is also used for interactive work. Your job on emperor may be delayed or killed by interactive users.

    Back to Top

    Setting up a DQS Job

    1. Create a DQS script file for your job. This is simply a script for your favorite shell, containing the commands that you want to execute in batch. Your job will be run from your home directory, so define your file paths accordingly. You can also include qsub command-line options in your script, as shown in the following example:

    #!/bin/csh
    #
    (your normal script goes here).
    

    2. Submit your job to the appropriate queue, using the qsub command.

    Common DQS commands

    Summary Examples

    qsub  -l medium,alpha $cwd/test.com
    qstat -f -l medium,alpha;
    qdel 67
    

    Submitting a job -- qsub

    qsub -cwd -l medium,alpha test.com
    qsub -cwd -e test.err -o $cwd/test,out -q short-janus $cwd/test.com
    

    See the DQS documentation or man pages for a list of all possible options. Generally, options can be included on the command line, or in your command file, as shown in the example in the section on Setting up a DQS Job.

    Discovering queue status -- qstat

    qstat
    qstat -f | grep -v "\*\*"
    qstat -f -l long,fast | grep -v "\*\*"
    

    Typical output from qstat -f -l long,alpha | grep -v "\*\*"

    Queue Name                       Queue Type    Quan  Load          State
    ----------                       ----------    ----  ----          -----
    sponge-artemis                   batch         1/2   1.73  er      UP
    sponge-hercules                  batch         0/1   0.00  eru     UNKNOWN
    

    The output shows that there are two queues that are both sponge and alpha. They are on artemis and hercules. There are currently 1 jobs running on sponge-artemis out of a maximum of 2; sponge-hercules has 0 out of a maximum of 1. Artemis has an average of 1.73 processes competing for its CPU; hercules has no processes active on it. The symbols er indicate that the queues are enabled and running. The symbol eru means that the queue is in an unknown state, ie, communication has been lost to the DQS system on hercules. Notify a staff member.

    Deleting a job -- qdel

    Each DQS job has a unique job-id. The job id is reported when you submit the job, and can be discovered at any time using the command qstat. To delete a job, use the command

    qdel jobid
    

    Back to Top

    Miscellaneous Information

    DQS Initialization Issues

    (Note -- this has not been verified under DQS 3.3.2, but is probably valid). When DQS starts your job, it intitially uses sh instead of csh or tcsh. Therefore, it does not execute the system-wide /etc/cshrc file. That means that some environment variables don't get set the way they do when you start an interactive session. One of these is the path variable. In addition, you might get some (hopefully) harmless error messages in your output.

    DQS Scheduling Algorithm

    (Note -- this has not been verified under DQS 3.3.2, but is probably valid). Presently, DQS Version 3.1.8 is implemented using the nice priority scheme. Under this scheme, on SGI computers, if a short job and a long job are running alone on a host, the short job will receive about 2/3 of the available time. A medium job will receive about 62% of the time in competition with a long job.

    However, DEC Alphas use a different scheduling algorithm. Tests indicate that all jobs get roughly an equal slice of available time, except sponge jobs. These will get no time, as long as any other job is competing.

    Back to Top

    How to Get More Information

    The documentation that came with the DQS system is available. DQS is a large and complex system, of which we use only a small part, so the full document set can be overwhelming.

    A User Guide describes the concepts of workings of DQS.

    The Reference Manual gives all the commands.

    The Installation and Maintenance manual might be of some help to the staff.

    Last Modified: Wednesday, 30-Apr-2003 15:57:45 EDT


    RC Home | Search | Table of Contents | General Information
    Richards Center (www.rc.yale.edu) at Yale University (www.yale.edu)
    Contact: michael^strickler_at_yale^edu