Content Comparison

...

Table of Contents

General Description

The University of Iowa's Argon HPC system is the latest HPC system of the University of Iowa. It consists of 612 compute nodes running CentOS-7.4 Linux. There are several compute node configurations,

24-core 512GB
32-core 64GB
32-core 256GB
32-core 512GB
40-core 96GB
40-core 192GB
56-core 128GB
56-core 256GB
56-core 512GB
64-core 192GB

The was deployed in February, 2017. There are several compute node configurations,

40-core 96GB
40-core 192GB
64-core 192GB
64-core 384GB
64-core 768GB
80-core 96GB
80-core 192GB
80-core 384GB
80-core 768GB
80-core 1.5TB
112-core 256GB
112-core 512GB
112-core 1024GB
112-core 1.5TB
128-core 256GB
128-core 512GB
128-core 1TB
128-core 1.5TB

The Argon cluster is split between two data centers,

ITF → Information Technology Facility
LC→ Lindquist Center

There are 21 machines with Nvidia P100 accelerators, 2 machines with Nvidia K80 accelerators, 11 machines with NVidia K20 accelerators, Most of the nodes in the LC datacenter are connected with the OmniPath high speed interconnect fabric. The nodes in the ITF data center are connected with a Mellanox Infiniband EDR fabric. There are two separate fabrics at ITF which do not interconnect. We refer to each of these fabrics as an island.

There are many machines with varying types of GPU accelerators:

21 machines with Nvidia P100 accelerators
2 machines with Nvidia K80 accelerators
2 machines with Nvidia P40 accelerators

...

17 machines with 1080Ti accelerators

...

19 machines with Titan V accelerators

...

Info
The Titan V is now considered as a supported configuration in Argon phase 1 GPU-capable compute nodes but is restricted to a single card per node. Staff have completed the qualification process for the 1080 Ti and concluded that it is not a viable solution to add to phase 1 Argon compute nodes.

Info
The R_peak needs to be updated.

The R_peak (theoretical Flops) is 385.0 TFlops, not including the accelerators, with 112 TB of memory. In addition, there are 2 login nodes of the Broadwell system architecture, with 256GB of memory each.

While on the backend Argon is a completely new architecture, the frontend should be very familiar to those who have used previous generation HPC systems at the University of Iowa. There are, however, a few key differences that will be discussed in this page.

Heterogeneity

While previous HPC cluster systems at UI have been very homogenous, the Argon HPC system has a heterogeneous mix of compute node types. In addition to the variability in the GPU accelerator types listed above, there are also differences in CPU architecture. We generally follow Intel marketing names, with the most important distinction being the AVX (Advanced Vector Extensions) unit on the processor. The following table lists the processors in increasing generational order.

...

14 machines with V100 accelerators
38 machines with 2080Ti accelerators
1 machine with RTX8000 accelerators
7 machines with A100 accelerators
5 machines with 4 A40 accelerators each
2 machines with 4 L40S accelerators each
1 machine with 4 L4 accelerators

Heterogeneity

While previous HPC cluster systems at UI have been very homogenous, the Argon HPC system has a heterogeneous mix of compute node types. In addition to the variability in the GPU accelerator types listed above, there are also differences in CPU architecture. We generally follow Intel marketing names, with the most important distinction being the AVX (Advanced Vector Extensions) unit on the processor. The following table lists the processors in increasing generational order.

Architecture	AVX level	Floating Point Operations per cycle
Haswell Broadwell	AVX2	16
Skylake Silver	AVX512	16 (1) AVX unit per processor core
Skylake Gold	AVX512	32 (2) AVX units per processor core
Cascadelake Gold	AVX512	32
Sapphire Rapids Gold	AVX512

Note that code must be optimized during compilation to take advantage of AVX instructions. The CPU architecture is important to keep in mind both in terms of potential performance and compatibility. For instance, code optimized for AVX2 AVX512 instructions will not run on the SandybridgeHaswell/Ivybridge Broadwell architecture because it only supports AVXAVX2, not AVX2AVX512. However, each successive generation is backward compatible so code optimized with AVX AVX2 instructions will run on HaswellSkylake/Broadwell Cascadelake systems.

...

Hyper Threaded Cores (HT)

One important difference between Argon and previous systems is that Argon has Hyperthreaded Hyper Threaded processor cores turned on. Hyperthreaded Hyper Threaded cores can be thought of as splitting a single processor into two virtual cores, much as a Linux process can be split into threads. That oversimplifies it but if your application is multithreaded then hyperthreaded Hyper Threaded cores can potentially run the application more efficiently. For non-threaded applications you can think of any pair of hyperthreaded Hyper Threaded cores to be roughly equivalent to two cores at half the speed if both cores of the pair are in use. This . Again, that is an over simplification, but the main point is that CPU bound processes perform better when not sharing a CPU core. Hyper Threaded cores can help ensure that the physical processor is kept busy for processes that do not always use the full capacity of a core. The reasons reason for enabling HT for Argon are is to try to increase system efficiency on the workloads that we have observed. There are some thing things to keep in mind as you are developing your workflows.

For high throughput jobs the use of HT can increase overall throughput by keeping cores active as jobs come and go. These jobs can treat each HT core as a processor.
For multithreaded applications, HT will provide more efficient handling of threads. You must make sure to request the appropriate number of job slots. Generally, the number of job slots requested should equal the number of cores that will be running.
For non-threaded CPU bound processes that can keep a core busy all of the time, you probably want to only run one process per core, and not run processes on HT cores. This can be accomplished by taking advantage of the Linux kernel's ability to bind processes to cores. In order to minimize processes running on the HT cores of a machine make sure that only half of the total number of cores are used. See below for more details but requesting twice the number of job slots as the number of cores that will be used will accomplish this. A good example of this type of job is non-threaded MPI jobs, but really any non-threaded job.

...

If your job script is written in bash syntax then you can use the $NSLOTS SGE variable as follows, using mpirun as an example:
Code Block
mpirun -np $(($NSLOTS/2)) ...

Job Scheduler/Resource Manager

Like previous UI HPC systems, Argon uses SGE, although this version is based off of a slightly different code-base. If anyone is interested in the history of SGE there is an interesting write up at History of Grid Engine Development. The version of SGE that Argon uses is from the Son of Grid Engine project. For the most part this will be very familiar to people who have used previous generations of UI HPC systems. One thing that will look a little different is the output of the qhost command. This will show the CPU topology.

No Formatcode

qhost -h argon-compute-1-01
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
argon-compute-1-01      lx-amd64       56    2   28   56  0.03  125.5G    1.1G    2.0G     0.0

...

You will need to be aware of the approximate amount of memory per job slot when setting up jobs if your job uses a significant amount of memory. The actual amount will vary due to OS overhead but the values below can be used for planning purposes.

20

Table plus

Node memory (GB)	Job slots	Memory (GB) per slot

64

96

32

40	2
96

40

80

2

1
128	56	2
192	40

4

5
192	64	3

256

192

32

80

8

2
256	56	4
256	112	2
256	128	2
384	64	6
384	80	5
512

24 (no HT)

56	9
512

32 (no HT)16

112	4
512	128

56

4

8

Version	Old Version 191	New Version Current
Changes made by	mckenna-kinley	John Saxton
Saved on	Feb 19, 2019	Nov 22, 2024

sortColumn	1
allowExport	true
columnTypes	S,S,S,I,I

Versions Compared

Key

General Description

Heterogeneity

Heterogeneity

Hyper Threaded Cores (HT)

Job Scheduler/Resource Manager

new SGE utilities

SSH to compute nodes

Queues and Policies