Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents

General Description

The University of Iowa's Argon HPC system is the latest HPC system of the University of Iowa. It consists of 612 compute nodes running CentOS-7.4 Linux. was deployed in February, 2017. There are several compute node configurations, 

...

  1. 40-core 96GB

  2. 40-core

...

  1. 192GB

...

  1. 64-core

...

  1. 192GB

...

  1. 64-core

...

  1. 384GB

...

  1. 64-core

...

  1. 768GB

...

  1. 80-core 96GB

...

  1. 80-core 192GB

...

  1. 80-core 384GB

  2. 80-core 768GB

  3. 80-core 1.5TB

  4. 112-core 256GB

  5. 112-core 512GB

  6. 112-core 1024GB

  7. 112-core 1.5TB

  8. 128-core 256GB

...

  1. 128-core 512GB

...

  1. 128-core 1TB

  2. 128-core

...

  1. 1.5TB

The Argon cluster is split between two data centers,

  • ITF → Information Technology Facility

  • LC→ Lindquist Center

Most of the nodes in the LC datacenter are connected with the OmniPath high speed interconnect fabric. The nodes in the ITF data center are connected with a Mellanox Infiniband EDR fabric. There are two separate fabrics at ITF which do not interconnect. We refer to each of these fabrics as an island.

There are many machines with varying types of GPU accelerators:

  1. 21 machines with Nvidia P100 accelerators

...

  1. 2 machines with Nvidia K80 accelerators

...

  1. 2 machines with Nvidia P40 accelerators

...

  1. 17 machines with 1080Ti accelerators

...

  1. 19 machines with Titan V accelerators

...

Info

The Titan V is now considered as a supported configuration in Argon phase 1 GPU-capable compute nodes but is restricted to a single card per node. Staff have completed the qualification process for the 1080 Ti and concluded that it is not a viable solution to add to phase 1 Argon compute nodes.

Info

The Rpeak needs to be updated.

The Rpeak (theoretical Flops) is 385.0 TFlops, not including the accelerators, with 112 TB of memory. In addition, there are 2 login nodes of the Broadwell system architecture, with 256GB of memory each. 

...

  1. 14 machines with V100 accelerators

  2. 38 machines with 2080Ti accelerators

  3. 1 machine with RTX8000 accelerators

  4. 7 machines with A100 accelerators

  5. 5 machines with 4 A40 accelerators each

  6. 2 machines with 4 L40S accelerators each

  7. 1 machine with 4 L4 accelerators


Heterogeneity

While previous HPC cluster systems at UI have been very homogenous, the Argon HPC system has a heterogeneous mix of compute node types. In addition to the variability in the GPU accelerator types listed above, there are also differences in CPU architecture. We generally follow Intel marketing names, with the most important distinction being the AVX (Advanced Vector Extensions) unit on the processor. The following table lists the processors in increasing generational order.

Architecture

AVX level

Floating Point Operations per cycle

Sandybridge
Ivybridge

AVX8

Haswell
Broadwell

AVX2

16

Skylake Silver

AVX512

16 (1) AVX unit per processor core

Skylake Gold

AVX512

32 (2) AVX units per processor core

Cascadelake Gold

AVX512

32

Sapphire Rapids Gold

AVX512

Note that code must be optimized during compilation to take advantage of AVX instructions. The CPU architecture is important to keep in mind both in terms of potential performance and compatibility. For instance, code optimized for AVX2 AVX512 instructions will not run on the SandybridgeHaswell/Ivybridge Broadwell architecture because it only supports AVXAVX2, not AVX2AVX512. However, each successive generation is backward compatible so code optimized with AVX AVX2 instructions will run on HaswellSkylake/Broadwell Cascadelake systems.

...

Hyper Threaded Cores (HT)

One important difference between Argon and previous systems is that Argon has Hyperthreaded Hyper Threaded processor cores turned on. Hyperthreaded Hyper Threaded cores can be thought of as splitting a single processor into two virtual cores, much as a Linux process can be split into threads. That oversimplifies it but if your application is multithreaded then hyperthreaded Hyper Threaded cores can potentially run the application more efficiently. For non-threaded applications you can think of any pair of hyperthreaded Hyper Threaded cores to be roughly equivalent to two cores at half the speed if both cores of the pair are in use. This . Again, that is an over simplification, but the main point is that CPU bound processes perform better when not sharing a CPU core. Hyper Threaded cores can help ensure that the physical processor is kept busy for processes that do not always use the full capacity of a core. The reasons reason for enabling HT for Argon are is to try to increase system efficiency on the workloads that we have observed. There are some thing things to keep in mind as you are developing your workflows.

  1. For high throughput jobs the use of HT can increase overall throughput by keeping cores active as jobs come and go. These jobs can treat each HT core as a processor.

  2. For multithreaded applications, HT will provide more efficient handling of threads. You must make sure to request the appropriate number of job slots. Generally, the number of job slots requested should equal the number of cores that will be running.

  3. For non-threaded CPU bound processes that can keep a core busy all of the time, you probably want to only run one process per core, and not run processes on HT cores. This can be accomplished by taking advantage of the Linux kernel's ability to bind processes to cores. In order to minimize processes running on the HT cores of a machine make sure that only half of the total number of cores are used. See below for more details but requesting twice the number of job slots as the number of cores that will be used will accomplish this. A good example of this type of job is non-threaded MPI jobs, but really any non-threaded job.

...

  1. If your job script is written in bash syntax then you can use the $NSLOTS SGE variable as follows, using mpirun as an example:

    Code Block
    mpirun -np $(($NSLOTS/2)) ...


Job Scheduler/Resource Manager

Like previous UI HPC systems, Argon uses SGE, although this version is based off of a slightly different code-base. If anyone is interested in the history of SGE there is an interesting write up at History of Grid Engine Development. The version of SGE that Argon uses is from the Son of Grid Engine project. For the most part this will be very familiar to people who have used previous generations of UI HPC systems. One thing that will look a little different is the output of the qhost command. This will show the CPU topology.

No Formatcode
qhost -h argon-compute-1-01
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
argon-compute-1-01      lx-amd64       56    2   28   56  0.03  125.5G    1.1G    2.0G     0.0

...

You will need to be aware of the approximate amount of memory per job slot when setting up jobs if your job uses a significant amount of memory. The actual amount will vary due to OS overhead but the values below can be used for planning purposes.

Table plus


Node memory (GB)

Job slots

Memory (GB) per slot

64

96

32

40

2

96

40

80

2

1

128

56

2

192

40

4

5

192

64

3

256

192

32

80

8

2

256

56

4

51224 (no HT)20
51232 (no HT)16
512568

256

112

2

256

128

2

384

64

6

384

80

5

512

56

9

512

112

4

512

128

4

768

64

12

768

80

9

1024

112

9

1024

128

8

1536

80

19

1536

112

13

1536

128

12


Using the Basic Job Submission and Advanced Job Submission pages as a reference, how would one submit jobs taking HT into account? For single process high throughput type jobs it probably does not matter, just request one slot per job. For multithreaded or MPI jobs, request one job slot per thread or process. So if your application runs best with 4 threads then request something like the following.

No Formatcode
qsub -pe smp 4

That will run on two physical cores and two HT cores. For non-threaded processes that are also CPU bound you can avoid running on HT cores by requesting 2x the number of slots as cores that will be used. So, if your process is a non-threaded MPI process, and you want to run 4 MPI ranks, your job submission would be something like the following.

No Formatcode
qsub -pe smp 8

and your job script would contain an mpirun command similar to

No Formatcode
mpirun -np 4 ...

That would run the 4 MPI ranks on physical cores and not HT cores. Note that this will work for non-MPI jobs as well. If you have a non-threaded process that you want to ensure runs on an actual core, you could use the same 2x slot request.

No Formatcode
qsub -pe smp 2

Note that if you do not use the above strategy then it is possible that your job process will share cores with other job processes. That may be okay, and preferred for high throughput jobs, but is something to keep in mind. It is especially important to keep this in mind when using the orte parallel environment. There is more discussion on the orte parallel environment on the Advanced Job Submission page. In short, that parallel environment is used in node sharing scenarios, which implies potential core sharing as well. For MPI jobs, that is probably not what you want. As on previous systems, there is a parallel environment (56cpnXcpn, where X is the number of cores per node) for requesting entire nodes. This is especially useful for MPI jobs to ensure the best performance. Note that there are some additional parallel environments that are akin to the smp and Xcpn ones but are specialized for certain software packages. These are

  • gaussian-sm and gaussian_lindaX, where X is the number of cores per node

  • turbomole_mpiX, where X is the number of cores per node

  • wien2k-sm and wien2k_mpiX, where X is the number of cores per node

For MPI jobs, the system provided openmpi will not bind processes to cores by default, as would be the normal default for openmpi. This is set this way to avoid inadvertently oversubscribing processes on cores. In addition, the system openmpi settings will map processes by socket. This should give a good process distribution in all cases. However, if you wish to use a number of processes less than 28 processes half of the slots per node in an MPI job then you may want to map by node to get the most even distribution of processes across nodes. You can do that with the --map-by node option flag to mpirun.

No Formatcode
mpirun --map-by node ...

If you wish to control mapping and binding in a more fine-grained manner, the mapping and binding parameters can be overridden with parameters to mpirun. Openmpi provides many options for fine grained control of process layout. The options that are set by default should be good in most cases but can be overridden with the openmpi options for

  • mapping → controls how processes are distributed across processing units

  • binding → binds processes to processing units

  • ranking → assigns MPI rank values to processes

See the mpirun manual page,

...

for more detailed information. The defaults should be fine for most cases but if you override them keep the topology in mind.

  • each node has 2 processor sockets

    each processor socket has 14 processor cores

  • each processor core has 2 hardware threads (HT)

If you set your own binding, for instance --bind-to core, be aware that the number of cores is half of the number of total HT processors. Note that core binding in and of itself may not really boost performance much. Generally speaking, if you want to minimize contention with hardware threads then simply request twice the number of slots than cores your job will use. Even if the processes are not bound to cores, the OS scheduler will do a good job of minimizing contention.

If your job does not use the system openmpi, or does not use MPI, then any desired core binding will need to be set up with whatever mechanism the software uses. Otherwise, there will be no core binding. Again, that may not be a major issue. If your job does not work well with HT then run on a number of cores equal to half of the number of slots requested and the OS scheduler will minimize contention. 

new SGE utilities

While SoGE is very similar to previous versions of SGE there are some new utilities that people may find of interest. There are manual pages for each of these.

  • qstatus: Reformats output of qstat and can calculate job statistics.

  • dead-nodes: This will tell you what nodes are not physically participating in the cluster.

  • idle-nodes: This will tell you what nodes do not have any activity on them.

  • busy-nodes: This will tell you what nodes are running jobs.

  • nodes-in-job: This is probably the most useful. Given a job ID it will list the nodes that are in use for that particular job.

SSH to compute nodes

On previous UI HPC systems it was possible to briefly ssh to any compute node, before getting booted from that node if a registered job was not found. This was sufficient to run an ssh command, for instance, on any node. This is not the case for Argon. SSH connections to compute nodes will only be allowed if you have a registered job on that host. Of course, qlogin sessions will allow you to login to a node directly as well. Again, if you have a job running on a node you can ssh to that node in order to check status, etc. You can find the nodes of a job with the nodes-in-job command mentioned above. We ask that you not do more than observe things while logged into the node as it may have shared jobs on it.

...