Advanced Job Submission

This document covers more advanced job submission options with an emphasis on submitting parallel jobs. Submission of parallel jobs to SGE requires the specification of a "parallel environment". A parallel environment is a construct that tells SGE how to allocate processors and nodes when a parallel job is submitted. A parallel environment can also run scripts that set up a proper environment for the job so there are times when a PE will be used for a specific application.

Types of parallel jobs

Before discussing what the SGE parallel environments look like, it is useful to describe the different types of parallel jobs that can be run. 

  1. Shared Memory. This is a type of parallel job that runs multiple threads or processes on a single multi-core machine. OpenMP programs are a type of shared memory parallel program.

    The OMP_NUM_THREADS variable is set to '1' by default. If your code can take advantage of the threading then specify OMP_NUM_THREADS to be equal to the number of job cores per node requested.

  2. Distributed Memory. This type of parallel job runs multiple processes over multiple processors with communication between them. This can be on a single machine but is typically thought of as going across multiple machines. There are several methods of achieving this via a message passing protocol but the most common, by far, is MPI (Message Passing Interface). There are several implementations of MPI but we have standardized on OpenMPI. This integrates very well with SGE by
    1. natively parsing the hosts file provided by SGE
    2. yielding process control to SGE
  3. Hybrid Shared/Distributed Memory. This type of parallel job uses distributed memory parallelism across compute nodes, and shared memory parallelism within each compute node. There are several methods for achieving this but the most common is OpenMP/MPI.

Shared memory

Shared memory jobs are fairly straightforward to set up with SGE. There is a shared memory parallel environment called, smp. This PE is set up to ensure that all slots requested reside on a single node.

It is important, therefore, to know how many slots are available in the machines. For example, a node with 16 cores will have 16 slots available in the queue instances of that host. Note that it may be necessary to specify the number of processors and/or number of threads that your computation will use either in the input file or an environment variable. Check the documentation of your software package for instructions on that. Once the number of processors and/or threads desired is determined simply pass that information to SGE with the smp parallel environment.

qsub -pe smp 12 myscript.job

Distributed memory

Distributed memory jobs are a little more involved to set up than shared memory jobs but they are well supported in SGE, particularly using MPI. The MPI implementations will get the hostfile, also called the machinefile from SGE, as well as the number of slots/cores requested. As long as the number of processor cores requested is equal to the number of processor cores that you want the job to run on, an MPI job script could be as simple as

Basic MPI script
#!/bin/sh
mpirun myprogram

The number of processors to use only needs to be specified as a parameter to the SGE parallel environment in the above case.

If you want to run your job with less processors than specified via SGE, or if you want control of how processes are distributed on the allocated nodes, then you will need to specify a PE that specifies the number of cores to use per node. In the case of OpenMPI, there are also mpirun flags that can override the SGE allocation rules.

There are several SGE parallel environments that can be used for distributed memory parallel jobs. Note that parallel applications must make use of the host/machine file that SGE makes available. MPI is the most common framework for distributed parallel applications. The parallel environments are:

  • orte (Open Run Time Environment)
    This will grab whatever slots are available anywhere on the cluster. 

    As discussed below, this is seldom the best option except under some circumstances.

  • Xcpn, where X is the number of cores per node
  • various application specific parallel environments

The orte PE is a common parallel environment normally associated with OpenMPI. While common, it has a very limited set of use cases and is usually not the best choice. It allows you to specify the number of slots needed for your job, and SGE will find available job slots from anywhere on the cluster, possibly including nodes currently running other jobs in other slots. It is useful if you just need available slots that can be allocated as quickly and easily as possible, and you don't care if your job shares nodes running other job processes at the same time. Despite technical safeguards, it is possible for processes from different jobs to interfere with each other on the same compute node, and sometimes even crash the node. Therefore, compared to other available PEs, a disadvantage of using the orte PE is that it exposes your job to the possibility of running on nodes with other jobs.

MPI jobs also tend to be timing-sensitive, so it is usually desirable to run them on nodes without the presence of other job processes for performance reasons too. This is the purpose of the Xcpn parallel environments, will only run on whole nodes where they occupy all slots. Therefore the total number of slots requested must be a multiple of X for the Xcpn PEs. For example, to request an MPI job which will use all 40 slots on 2 nodes:

This will request exactly 2 nodes.

qsub -pe 40cpn 80  myscript.job

If your job can only take advantage of 64 total processes, and you are using 40-slot nodes but you still want whole nodes, request the same resources but modify your mpirun invocation as shown below:

Using fewer ranks than slots requested
#!/bin/sh
#$ -pe 40cpn 80 
mpirun -n 64 --map-by node myprogram

The example above will allocate all 40 slots on two 40-slot nodes totalling 80 slots as before. Also as before, this means SGE allocates all slots on both nodes, leaving no slots free to run other jobs as long as your job is running. However, the -n 64 option passed to mpirun will limit the number of MPI ranks launched (64), and the --map-by node flag will distribute them evenly in round-robin fashion among all involved nodes, allocating 32 ranks each on both nodes in this example. With OpenMPI, there are additional mpirun parameters that can further control the distribution. As with any job, there must be enough slots to provide both its memory and processor requirements, whichever is greater. If you have a job where each MPI rank requires more than one slot of memory on the nodes you use, then the job will fill each node's memory without its ranks fully populating the node's processors. In this case you would need to limit your MPI program to use fewer ranks than the number of slots you requested on each node, and request more nodes to fit all your ranks into memory.

The orte PE can not be used for jobs that need to run on less processes than allocated. This is because the distribution of jobs slots across nodes is unknown a priori.

Hybrid Shared/Distributed memory

This type of parallel job is the most complex to set up. The most common type of hybrid job uses MPI with OpennMP, so it is the example shown here. The first thing to determine is the appropriate mix of OpenMP threads and MPI processes. That will be very application-dependent and may require some experimentation. The number of OpenMP threads to use per MPI process needs to be specified by overriding the the OMP_NUM_THREADS environment variable, otherwise the default value remains 1. The number of MPI processes is specified as an option to the mpirun command as in previous examples. You must also request from the SGE parallel environment at least as many slots as the total number of threads and processes.

Since one cannot know a priori how many nodes will be allocated using the orte PE, it will work not work for hybrid parallel applications.

For example, if you need to configure your 64 thread hybrid job to run on 2 40-slot nodes with 32 OpenMP threads and 1 MPI rank on each node, the job script should contain the following:

myscript.job
#!/bin/sh
# Example hybrid parallel setup
#$ -pe 40cpn 80
OMP_NUM_THREADS=32
mpirun -n 2 --map-by node myprogram

Other parallel environments

In addition to the SGE parallel environments mentioned above, there are a few other special purpose PEs, mostly for commercial software.

  • fluent → for running Ansys Fluent. This has hooks in it to handle fluent startup/shutdown, as well as limits for license tokens.
  • gaussian-sm → This is for running Gaussian09 in shared memory mode.
  • gaussian-linda → This is for running Gaussian09 in distributed memory and hybrid shared/distributed memory modes.

Resource requests

The Argon cluster is a heterogeneous cluster, meaning that it consists of different node types with varying amounts and types of resources. There are many resources that SGE keeps track of and most of them can be used in job submissions. However, the resource designations for machines based on CPU type, memory amount, and GPU are more likely to be used in practice. Note that there can be very different performance characteristics for different types of GPUs and CPUs. As noted above, the Argon cluster is split between two data centers,

  • ITF → Information Technology Facility
  • LC→ Lindquist Center

As we expand the capacity of the cluster the datacenter selection could be important for multi node jobs, such as those that use MPI, that require a high speed interconnect fabric. The compute nodes used in those job would need to be in the same data center. Currently, all nodes with the OmniPath fabric are located in the LC datacenter, and all nodes with the InfiniPath fabric are in the ITF datacenter. If you are running MPI jobs you will want to make sure that you do not run jobs with mixed fabrics if the job will run in a queue with multiple node types. The best way to do that is to include a fabric resource request in your job submission.

The investor queues will have a more limited variability in machine types. However, the all.q queue contains all machines and when running jobs in that queue it may be desirable to request specific machine types. The following table lists the options for the more common resource requests. They would be selected with the '-l resource' flag to qsub. Note that using these is a way to filter nodes, restricting the pool of available nodes to more specific subsets of nodes. These are only required if you have a need for specific resources.

Full Resource NameShortcut Resource NameNotes
std_mem
deprecated
sm
deprecated
use mem_128G
mid_mem
deprecated
mm
deprecated
use mem_256G
high_mem
deprecated

hm
deprecated

use mem_512G
mem_64G64G
mem_96G96G
mem_128G128G
mem_192G192G
mem_256G256G
mem_384G384G
mem_512G512G
mem_768G768G
mem_1536G1536G
mem_per_slotmpsSelect nodes based on the amount of memory per slot the node has.
cpu_archcpu_arch

Newer CPUs have support for higher performance streaming SIMD instructions.
The order from lowest to highest is:
SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512
Note that if you compile your code with support for AVX2 or higher then you must run on a processor that supports it.
Otherwise, you will get an illegal instruction message.

  • broadwell (supports up to AVX2 SIMD)
  • skylake_silver (supports up to AVX512 SIMD)
  • skylake_gold (supports up to AVX512 SIMD)
  • cascadelake_gold (supports up to AVX512 SIMD)
datacenterdc
  • ITF
  • LC
fabricfabric
  • none*
  • omnipath
  • infiniband

* no high speed interconnect fabric

gpugpuSelect any available GPU type.
gpu_k80k80
gpu_p100p100
gpu_v100v100
gpu_p40

p40


gpu_1080ti1080ti
gpu_2080ti2080ti
gpu_titanvtitanv
gpu_titanv_jhhtitanv_jhhThis is a special edition Titan V.
gpu_rtx8000rtx8000
ngpusngpusSpecify the number of GPU devices that you wish to use
nvlinksnvlSpecify the number of NVLink bridges that you wish to use. This will automatically set ngpus. There are two GPU cards per NVLink bridge.
localscratch_freelscr_freeAmount of scratch space needed for job.

For Boolean resources, such as "gpu_*" can be selected in a fine-grained way by specifying exactly the one that is desired, as in 

qsub -l gpu_v100=true

If there are specific types to avoid, then something like:

qsub -l gpu=true -l gpu_k80=false -l gpu_1080ti=false

For string resources, such as "cpu_arch", selection is not as fine-grained. You can select a specific CPU architecture with

qsub -l cpu_arch=cascadelake

or you can exclude a single value with:

qsub -l cpu_arch='!skylake_silver'

Note that the quoting is important.

GPU resources

If you wish to use a compute node that contains a GPU then it must be explicitly requested in some form. The table above lists the Boolean resources for selecting a specific GPU, or any one of the types, with the generic gpu resource.

For example, if you run a job in the all.q queue and want to use a node with a GPU, but do not care which type,

qsub -l ngpus=1

If you specifically wanted to use a node with a P100 GPU,

qsub -l gpu_p100=true

or use the shortcut,

qsub -l p100=true

In all cases, requesting any of the GPU Boolean resources will set the ngpus resource value to 1 to signify to the scheduler that 1 GPU device is required. If your job needs more than one GPU than that can be specified explicitly with the ngpus resource. For example,

qsub -l ngpus=2

Note that requesting one of the *-GPU queues will automatically set ngpus=1 if that resource is not otherwise set. However, you will have to know what types of GPUs are in those queues if you need a specific type. Investor queues that have a mix of GPU and non-GPU nodes, ie., without the -GPU suffix will need to make a request for a GPU explicit. Since ngpus is a consumable resource, once the resource, the GPU device, is in use, then it is not available for other jobs on that node until it is freed up. If you wish to run non-GPU jobs on the node in tandem with a GPU job then specify ngpus=false for the non-GPU job(s). 

 More info...

Having to set ngpus=false rather than ngpus=0 is due to a quirk in how the backend processes treat Boolean representations of TRUE/FALSE and 0/1. Using the value of 'false' is the only way to have it behave correctly.

In addition to the ngpus resource there some other non-Boolean resources for GPU nodes that could be useful to you. With the exception of requesting free memory on a GPU device these are informational.

ResourceDescriptionRequestable
gpu.ncuda

number of CUDA GPUs on the host

NO
gpu.ndev

total number of GPUs on the host

NO
gpu.cuda.N.mem_free

free memory on CUDA GPU N

YES
gpu.cuda.N.procs

number of processes on CUDA GPU N

NO
gpu.cuda.N.clock

maximum clock speed of CUDA GPU N (in MHz)

NO

gpu.cuda.N.util

compute utilization of CUDA GPU N (in %)

NO
gpu.names

semi-colon-separated list of GPU model names

NO

For example, to request a node with at least 2G of memory available on the first GPU device:

qsub -l gpu.cuda.0.mem_free=2G

When there are more than one GPU devices on a node, your job will only be presented with unused devices. Thus, if a node has two GPU devices and your job requests one, ngpus=1, then the job will only see a single free device. If the node is shared then a second job requesting a single GPU will only see the device that is left available. Thus, you should not have to specify which GPU device to use for your job.

Memory

Job slots are a type of resource you can request when submitting a job to the cluster. Each compute node has some number of slots, and each requested slot gives the job permission to use one logical processor core and some amount of memory which depends on how many such processors and how much memory exist on the compute node.

Generally, jobs run in some number of slots using a parallel environment. If the job doesn't specify a request for PE and slot count, it generally receives one slot in the smp PE by default. Jobs using the smp PE occupy only the number of slots requested on their compute node, leaving any other slots available to run other jobs, effectively allowing multiple jobs to share a compute node without interfering. Note that jobs in the UI and all.q queues will have per process memory limits set by default, except those which request whole compute nodes as described here.

When determining how many slots your job should request, consider A) the number of slots needed for its maximum memory usage, and B) the maximum number of concurrent processes it runs. Request at least as many slots as the larger number, possibly adding a modest amount extra as a safety margin if memory usage isn't precisely known. It's important to request enough slots for your job to complete. However, if you request significantly more than required, those resources will be reserved for your job but remain idle, and other jobs will have to wait. The goal is to find a reasonable balance that lets your jobs run reliably.

Most programs operate the normal way using processors and memory on a single computer, whereas MPI programs typically span multiple computers. In either case, you need to request at least enough slots to provide the maximum amount of memory your job will use at any point during execution, otherwise the job will exceed its limit and the cluster software will terminate it to protect the system and any other jobs on the same compute node. If your job was terminated, you can inspect its statistics like so:

qacct -j $JOBID

If the maxvmem field shows it reached a memory limit, you should specify resource requests to ensure it receives enough memory when you resubmit it. Note that statistics for memory usage are reported for the entire job, regardless of how many processes the job runs.

Similarly, you can query a job's memory usage at a particular moment while it runs using the qstat command, looking at the vmem and maxvmem entries in the usage output to see if they are near a limit. This would be useful in cases you believe your job's memory usage is roughly constant over time so that any moment you run the command yields a representative sample. For example:

qstat -j 411298 | grep usage
usage    1:                 cpu=00:23:19, mem=246.85762 GBs, io=0.01576, vmem=433.180M, maxvmem=433.180M

For example, if you know your job runs a single-threaded program which requires 12G of memory for your analysis, a good strategy is to request enough slots in the smp PE to guarantee that amount regardless of what type of node the cluster software finds to run it. If you know the smallest slots are 2G, then you might request 6 slots; or if you know the smallest slots are only 1G, you would need to request at least 12; etc.:

#$ -pe smp 12
myprogram ...

The same principle applies for MPI jobs using an Xcpn PE to request whole nodes. For example, a 24 process MPI job requiring 8 GB per process would have to use a request like the following to use 56-slot nodes of 128G each:

#$ -pe 56cpn 112
mpirun -n 24 --map-by node program ...

The above would allocate all 56 slots on each of two 56-core nodes and run 12 MPI processes on each node, using 96G per node. Requesting all slots on both nodes guarantees the nodes will not be shared with other jobs.

Another method is to request that a suitably large amount of certain types of memory be free when the job starts; this ensures the job will only be scheduled on a node you know has slots at least large enough that the number you request provides enough memory for the job. Examples:

#$ -l mf=128G  # Require machines with 128G of free memory, effectively allowing only nodes larger than 128G.
#$ -l sf=2G    # Require machines with 2G of free swap.
#$ -l mt=96G   # Require machines with a total of 96G of memory.

Note that these are not resource limits enforced by the cluster software; they are only a way to require that resources are in the specified state at the time the job is scheduled.

localscratch

If a job will be using a significant amount of space in the /localscratch directory then it is best practice to request that amount with the localscratch_free resource. For example, if 500G is needed:

#$ -l localscratch_free=500G

Unlike the similar mem_free resource, the localscratch_free resource is consumable and will immediately reduce the capacity of free space reported on /localscratch by the amount requested. That will help keep other jobs that also need SSD space from landing on a node with insufficient space. The more accurate the request is with regard to what is used the more efficient the scheduling can be. If your job does use a lot of scratch space please make sure to clean the space when the job is done. Otherwise, future jobs will be blocked due to insufficient space. One way to easily accomplish cleaning is to use the job directory pointed to with the $TMPDIR environment variable.

hosts

# to run on a specific machine
#$ -l h=compute-2-36

Queues

Queue requests are governed by access lists. Please make sure you have access to a queue before requesting or your job will fail to launch. To see which queues you have access to run the following command.

whichq
# to request the CGRER investor queue.
#$ -q CGRER

If you have access to multiple queues then you can combine the resources by specifying multiple queues for the queue resource.

#$ -q CGRER,COE

One should not specify the all.q queue in a queue submission request along with an investor queue. In that scenario, the processes that land in the investor queue could immediately evict processes that land in the all.q queue, which will have the undesired effect of suspending the processes in the investor queue as well.

Also, the development queue should only be used for testing and not used for production jobs.

Job priorities

If you have not been using the cluster much recently, your jobs will likely have a higher priority than other users who are submitting lots of jobs, and your job should move towards the front of the queue. Conversely, if you have been submitting a lot of jobs, your job priorities will be lowered. This is dependent on the number of job slots requested as parallel jobs will get a higher priority in proportion to the number of requested slots. This is done in an effort to ensure that larger slot count jobs do not get starved while waiting for resources to free up.

Checkpoint/Restart

Many applications that run long running computations will create checkpoint files. These checkpoint files are then used to restart a job starting at the point where the checkpoint was last written. This capability is particularly useful for jobs submitted to the all.q queue as those jobs may be evicted so that investor jobs can run. SGE provides a mechanism to take advantage of checkpoints if your application supports them. SGE creates an environment variable (RESTARTED) that can be checked if the user specified that a job is restartable. To take advantage of the application's checkpoint capability the job script should check for the value of the $RESTARTED environment variable. A simple "if...then" block should be sufficient.

if [ $RESTARTED = 1 ]; then
    .
    .
    .
    some commands to set up the computation for a restart
    .
    .
    .
fi

Additionally, the job needs to be submitted with the '-ckpt user' option to the qsub command. This is what tells SGE that it is possible to restart a job. Without that option, SGE will not look for the $RESTARTED environment variable. Even if your application does not perform checkpoint operations, the checkpoint/restart capability of SGE can still be used to restart your job. In this case, it would just restart from the beginning. This has the effect of migrating the job to available nodes (if there are any) after a job eviction or simply re-queueing the job. 

Not all jobs can be restarted and those that can need to be set up to do so. It is your responsibility to set up the commands in the job script file that will allow the computation to restart. In some cases, this is simply using the latest output file as an input file. In other cases an input file with different options is needed.

Note that there is a second circumstance under which the RESTARTED variable will get set to a non-zero value. That is when the -r (rerun) option is specified for qsub and the job aborts on a node that crashes. It is not always predictable how a job will react on a crashed node so this option may or may not work for any given job. It will not hurt to specify it but the -ckpt facility is the option to use for restarts in the event of job eviction from the all.q queue. Also, the ckpt facility will migrate jobs in the event of a clean shutdown of a compute node, such as a reboot, or any other event that cleanly shuts down the sge_execd process on the node.

Job Dependencies

There are times when you want to make sure a job is completed before another one is started. This can be accomplished by specifying job dependencies. To use this, use the -hold_jid flag of qsub. That flag takes a JOB_ID as a parameter and that JOB_ID is the job that needs to be completed before the current job submission can be launched. So assuming that job A needs to complete before job B can begin

qsub test_A.job
Your job 3808141 ("test_A.job") has been submitted

That will return a JOB_ID in the output. Use that in the subsequent job submission

qsub -hold_jid 3808141 test_B.job

You will see something like the following in the qstat output

qstat -u gpjohnsn
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3808141 0.50233 test_A.job gpjohnsn     r     01/15/2014 15:40:30 sandbox@compute-6-174.local        1        
3808142 0.00000 test_B.job gpjohnsn     hqw   01/15/2014 15:40:53                                    1        

The test_B job will not begin until the test_A job is complete. There is a handy way to capture the JOB_ID in an automated way using the -terse flag of qsub.

hold_jid=$(qsub -terse test_A.job)
qsub -hold_jid $hold_jid test_B.job

The first line will capture the output of the JOB_ID and put it in the hold_jid variable. You can use whatever legal variable name you want for that. The second line will make use of the previously stored variable containing the JOB_ID.

If the first job is an array job you will have to do a little more filtering on the output from the first job.

hold_jid=$(qsub -terse test_A.job | awk -F. '{print $1}')
qsub -hold_jid $hold_jid test_B.job

If both jobs are array jobs then you can specify array dependencies with the -hold_jid_ad flag to qsub. In that case the array jobs have to be the same size and the dependencies are on the tasks such that when test_A, task 1 is complete test_B, task 1 can start.