Page Comparison

Table of Contents

...

If your job does not use the system openmpi, or does not use MPI, then any desired core binding will need to be set up with whatever mechanism the software uses. Otherwise, there will be no core binding. Again, that may not be a major issue. If your job does not work well with HT then run on a number of cores equal to half of the number of slots requested and the OS scheduler will minimize contention.

new SGE utilities

While SoGE is very similar to previous versions of SGE there are some new utilities that people may find of interest. There are manual pages for each of these.

...

In addition to the above, the HPC systems have some nodes that are not part of any investor queue. These are in the /wiki/spaces/hpcdocs/pages/76513448 and are used for node rentals and future purchases. The number of nodes for this purpose varies.

Resource requests

The Argon cluster is a heterogeneous cluster, meaning that it consists of different node types with varying amounts and types of resources. There are many resources that SGE keeps track of and most of them can be used in job submissions. However, the resource designations for machines based on CPU type, memory amount, and GPU are more likely to be used in practice. Note that there can be very different performance characteristics for different types of GPUs and CPUs. As noted above, the Argon cluster is split between two data centers,

ITF → Information Technology Facility
LC→ Lindquist Center

As we expand the capacity of the cluster the datacenter selection could be important for multi node jobs, such as those that use MPI, that require a high speed interconnect fabric. The compute nodes used in those job would need to be in the same data center.

Info
Currently, all nodes with the OmniPath fabric are located in the LC datacenter. All nodes that have the `56cpn` parallel environment have access to the OmniPath fabric.

...

hm
deprecated

...

broadwell
skylake_silver

...

ITF
LC

...

none^*
omnipath

* no high speed interconnect fabric

...

p40

...

GPU resources

If you wish to use a compute node that contains a GPU then it must be explicitly requested in some form. The table above lists the Boolean resources for selecting a specific GPU, or any one of the types, with the generic gpu resource.

For example, if you run a job in the all.q queue and want to use a node with a GPU, but do not care which type,

qsub -l ngpus=1

If you specifically wanted to use a node with a P100 GPU,

qsub -l gpu_p100=true

or use the shortcut,

qsub -l p100=true

In all cases, requesting any of the GPU Boolean resources will set the ngpus resource value to 1 to signify to the scheduler that 1 GPU device is required. If your job needs more than one GPU than that can be specified explicitly with the ngpus resource. For example,

qsub -l ngpus=2

...

In addition to the ngpus resource there some other non-Boolean resources for GPU nodes that could be useful to you. With the exception of requesting free memory on a GPU device these are informational.

...

number of CUDA GPUs on the host

...

number of OpenCL GPUs on the host

...

total number of GPUs on the host

...

free memory on CUDA GPU N

...

number of processes on CUDA GPU N

...

maximum clock speed of CUDA GPU N (in MHz)

...

gpu.cuda.N.util

...

compute utilization of CUDA GPU N (in %)

...

maximum clock speed of OpenCL GPU N (in MHz)

...

global memory of OpenCL GPU N

...

semi-colon-separated list of GPU model names

...

For example, to request a node with at least 2G of memory available on the first GPU device:

qsub -l gpu.cuda.0.mem_free=2G

When there are more than one GPU devices on a node, your job will only be presented with unused devices. Thus, if a node has two GPU devices and your job requests one, ngpus=1, then the job will only see a single free device. If the node is shared then a second job requesting a single GPU will only see the device that is left available. Thus, you should not have to specify which GPU device to use for your job.

Version	Old Version 162	New Version 163
Changes made by	glenn-johnson	glenn-johnson
Saved on	Jan 11, 2019	Jan 11, 2019

Versions Compared

Key

new SGE utilities

Resource requests

GPU resources