Page Comparison

...

40-core 96GB
40-core 192GB
56-core 128GB
56-core 256GB
56-core 512GB
64-core 192GB
64-core 384GB
64-core 768GB
80-core 96GB
80-core 192GB
80-core 384GB
80-core 768GB
80-core 1.5TB
112-core 256GB
112-core 512GB
112-core 1024GB
80112-core 1.5TB112
128-core 256GB
128-core 512GB
128-core 1TB
128-core 1.5TB

The Argon cluster is split between two data centers,

...

Most of the nodes in the LC datacenter are connected with the OmniPath high speed interconnect fabric, while most of those . The nodes in the ITF data center are connected with the InfiniPath fabric, with the latest nodes having a Mellanox Infiniband EDR fabric. There are two separate fabrics at ITF which do not interconnect. We refer to each of these fabrics as an island.

There are many machines with varying types of GPU accelerators:

21 machines with Nvidia P100 accelerators
2 machines with Nvidia K80 accelerators
2 machines with Nvidia P40 accelerators
17 machines with 1080Ti accelerators
19 machines with Titan V accelerators
14 machines with V100 accelerators
38 machines with 2080Ti accelerators
1 machine with RTX8000 accelerators
7 machines with A100 accelerators
5 machines with 4 A40 accelerators

...

each
2 machines with 4 L40S accelerators each
1 machine with 4 L4 accelerators

Heterogeneity

While previous HPC cluster systems at UI have been very homogenous, the Argon HPC system has a heterogeneous mix of compute node types. In addition to the variability in the GPU accelerator types listed above, there are also differences in CPU architecture. We generally follow Intel marketing names, with the most important distinction being the AVX (Advanced Vector Extensions) unit on the processor. The following table lists the processors in increasing generational order.

Architecture	AVX level	Floating Point Operations per cycle
Haswell Broadwell	AVX2	16
Skylake Silver	AVX512	16 (1) AVX unit per processor core
Skylake Gold	AVX512	32 (2) AVX units per processor core
Cascadelake Gold	AVX512	32
Sapphire Rapids Gold	AVX512

Note that code must be optimized during compilation to take advantage of AVX instructions. The CPU architecture is important to keep in mind both in terms of potential performance and compatibility. For instance, code optimized for AVX512 instructions will not run on the Haswell/Broadwell architecture because it only supports AVX2, not AVX512. However, each successive generation is backward compatible so code optimized with AVX2 instructions will run on Skylake/Cascadelake systems.

...

Table plus

Node memory (GB)	Job slots	Memory (GB) per slot
96	40	2
96	80	1
128	56	2
192	40	5
192	64	3
192	80	2
256	56	4
256	112	2
256	128	2
384	64	6
384	80	5
512	56	9
512	112	4
512	128	4
768	64	12
768	80	9
1024	112	9
1024	128	8
1536	80	19
1536	112	13
1536	128	12

Using the Basic Job Submission and Advanced Job Submission pages as a reference, how would one submit jobs taking HT into account? For single process high throughput type jobs it probably does not matter, just request one slot per job. For multithreaded or MPI jobs, request one job slot per thread or process. So if your application runs best with 4 threads then request something like the following.

...

Versions Compared

Old Version 218

New Version Current

Key

Heterogeneity