General Description

The Argon HPC system is the latest HPC system of the University of Iowa. It consists of 366 compute nodes running CentOS-7.4 Linux. There are several compute node configurations, 

  1. 24-core 512GB
  2. 32-core 64GB
  3. 32-core 256GB
  4. 32-core 512GB
  5. 40-core 96GB
  6. 40-core 192GB
  7. 56-core 128GB
  8. 56-core 256GB
  9. 56-core 512GB
  10. 64-core 192GB

The Argon cluster is split between two data centers,

There are 21 machines with Nvidia P100 accelerators, 2 machines with Nvidia K80 accelerators, 11 machines with NVidia K20 accelerators, 2 machines with Nvidia P40 accelerators, 13 machines with 1080Ti accelerators, and 18 machines with Titan V accelerators. Most of the nodes in the LC datacenter are connected with the OmniPath high speed interconnect fabric, while most of those in the ITGF data center are connected with the InfiniPath fabric.

The Titan V is now considered as a supported configuration in Argon phase 1 GPU-capable compute nodes but is restricted to a single card per node. Staff have completed the qualification process for the 1080 Ti and concluded that it is not a viable solution to add to phase 1 Argon compute nodes.


The Rpeak needs to be updated.

The Rpeak (theoretical Flops) is 385.0 TFlops, not including the accelerators, with 111 TB of memory. In addition, there are 2 login nodes of the Broadwell system architecture, with 256GB of memory each. 

While on the backend Argon is a completely new architecture, the frontend should be very familiar to those who have used previous generation HPC systems at the University of Iowa. There are, however, a few key differences that will be discussed in this page. 

Heterogeneity

While previous HPC cluster systems at UI have been very homogenous, the Argon HPC system has a heterogeneous mix of compute node types. In addition to the variability in the GPU accelerator types listed above, there are also differences in CPU architecture. We generally follow Intel marketing names, with the most important distinction being the AVX (Advanced Vector Extensions) unit on the processor. The following table lists the processors in increasing generational order.

ArchitectureAVX levelFloating Point Operations per cycle
Sandybridge
Ivybridge
AVX8
Haswell
Broadwell
AVX216
Skylake SilverAVX51216 (1) AVX unit per processor core
Skylake GoldAVX51232 (2) AVX units per processor core

Note that code must be optimized during compilation to take advantage of AVX instructions. The CPU architecture is important to keep in mind both in terms of potential performance and compatibility. For instance, code optimized for AVX2 instructions will not run on the Sandybridge/Ivybridge architecture because it only supports AVX, not AVX2. However, each successive generation is backward compatible so code optimized with AVX instructions will run on Haswell/Broadwell systems.

Hyperthreaded Cores (HT)

One important difference between Argon and previous systems is that Argon has Hyperthreaded processor cores turned on. Hyperthreaded cores can be thought of as splitting a single processor into two virtual cores, much as a Linux process can be split into threads. That oversimplifies it but if your application is multithreaded then hyperthreaded cores can potentially run the application more efficiently. For non-threaded applications you can think of any pair of hyperthreaded cores to be roughly equivalent to two cores at half the speed if both cores of the pair are in use. This can help ensure that the physical processor is kept busy for processes that do not always use the full capacity of a core. The reasons for enabling HT for Argon are to try to increase system efficiency on the workloads that we have observed. There are some thing to keep in mind as you are developing your workflows.

  1. For high throughput jobs the use of HT can increase overall throughput by keeping cores active as jobs come and go. These jobs can treat each HT core as a processor.
  2. For multithreaded applications, HT will provide more efficient handling of threads. You must make sure to request the appropriate number of job slots. Generally, the number of job slots requested should equal the number of cores that will be running.
  3. For non-threaded CPU bound processes that can keep a core busy all of the time, you probably want to only run one process per core, and not run processes on HT cores. This can be accomplished by taking advantage of the Linux kernel's ability to bind processes to cores. In order to minimize processes running on the HT cores of a machine make sure that only half of the total number of cores are used. See below for more details but requesting twice the number of job slots as the number of cores that will be used will accomplish this. A good example of this type of job is non-threaded MPI jobs, but really any non-threaded job.

After the merger of Argon and Neon, there are a few of the older nodes that are not HT capable. These are the High Memory nodes with cpu_arch=sandybridge/ivybridge.


Job Scheduler/Resource Manager

Like previous UI HPC systems, Argon uses SGE, although this version is based off of a slightly different code-base. If anyone is interested in the history of SGE there is an interesting write up at History of Grid Engine Development. The version of SGE that Argon uses is from the Son of Grid Engine project. For the most part this will be very familiar to people who have used previous generations of UI HPC systems. One thing that will look a little different is the output of the qhost command. This will show the CPU topology.

qhost -h argon-compute-1-01
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
argon-compute-1-01      lx-amd64       56    2   28   56  0.03  125.5G    1.1G    2.0G     0.0

As you can see that shows the number of cpus (NCPU), the number of CPU sockets (NSOC), the number of cores (NCOR) and the number of threads (NTHR). This information could be important as you plan jobs but it essentially reflects what was said in regard to HT cores.

You will need to be aware of the approximate amount of memory per job slot when setting up jobs if your job uses a significant amount of memory. The actual amount will vary due to OS overhead but the values below can be used for planning purposes.


Node memory (GB)Job slotsMemory (GB) per slot
64322
96402
128562
192404
192643
256328
256564
51224 (no HT)20
51232 (no HT)16
512568


Using the Basic Job Submission and Advanced Job Submission pages as a reference, how would one submit jobs taking HT into account? For single process high throughput type jobs it probably does not matter, just request one slot per job. For multithreaded or MPI jobs, request one job slot per thread or process. So if your application runs best with 4 threads then request something like the following.

qsub -pe smp 4

That will run on two physical cores and two HT cores. For non-threaded processes that are also CPU bound you can avoid running on HT cores by requesting 2x the number of slots as cores that will be used. So, if your process is a non-threaded MPI process, and you want to run 4 MPI ranks, your job submission would be something like the following.

qsub -pe smp 8

and your job script would contain an mpirun command similar to

mpirun -np 4 ...

That would run the 4 MPI ranks on physical cores and not HT cores. Note that this will work for non-MPI jobs as well. If you have a non-threaded process that you want to ensure runs on an actual core, you could use the same 2x slot request.

qsub -pe smp 2

Note that if you do not use the above strategy then it is possible that your job process will share cores with other job processes. That may be okay, and preferred for high throughput jobs, but is something to keep in mind. It is especially important to keep this in mind when using the orte parallel environment. There is more discussion on the orte parallel environment on the Advanced Job Submission page. In short, that parallel environment is used in node sharing scenarios, which implies potential core sharing as well. For MPI jobs, that is probably not what you want. As on previous systems, there is a parallel environment (56cpn) for requesting entire nodes. This is especially useful for MPI jobs to ensure the best performance.

For MPI jobs, the system provided openmpi will not bind processes to cores by default, as would be the normal default for openmpi. This is set this way to avoid inadvertently oversubscribing processes on cores. In addition, the system openmpi settings will map processes by socket. This should give a good process distribution in all cases. However, if you wish to use less than 28 processes per node in an MPI job then you may want to map by node to get the most even distribution of processes across nodes. You can do that with the --map-by node option flag to mpirun.

mpirun --map-by node ...

If you wish to control mapping and binding in a more fine-grained manner, the mapping and binding parameters can be overridden with parameters to mpirun. Openmpi provides many options for fine grained control of process layout. The options that are set by default should be good in most cases but can be overridden with the openmpi options for

See the mpirun manual page,

man mpirun

for more detailed information. The defaults should be fine for most cases but if you override them keep the topology in mind.

If you set your own binding, for instance --bind-to core, be aware that the number of cores is half of the number of total HT processors. Note that core binding in and of itself may not really boost performance much. Generally speaking, if you want to minimize contention with hardware threads then simply request twice the number of slots than cores your job will use. Even if the processes are not bound to cores, the OS scheduler will do a good job of minimizing contention.

If your job does not use the system openmpi, or does not use MPI, then any desired core binding will need to be set up with whatever mechanism the software uses. Otherwise, there will be no core binding. Again, that may not be a major issue. If your job does not work well with HT then run on a number of cores equal to half of the number of slots requested and the OS scheduler will minimize contention. 

new SGE utilities

While SoGE is very similar to previous versions of SGE there are some new utilities that people may find of interest. There are manual pages for each of these.

SSH to compute nodes

On previous UI HPC systems it was possible to briefly ssh to any compute node, before getting booted from that node if a registered job was not found. This was sufficient to run an ssh command, for instance, on any node. This is not the case for Argon. SSH connections to compute nodes will only be allowed if you have a registered job on that host. Of course, qlogin sessions will allow you to login to a node directly as well. Again, if you have a job running on a node you can ssh to that node in order to check status, etc. You can find the nodes of a job with the nodes-in-job command mentioned above. We ask that you not do more than observe things while logged into the node as it may have shared jobs on it.

Queues and Policies


QueueNode DescriptionQueue ManagerSlotsTotal memory (GB)
AML(1) 56-core 256GBAaron Miller56256
ANTH

(4) 56-core 128GB
(2) 32-core 64GB

Andrew Kitchen278640

ARROMA

(8) 56-core 128GBJun Wang4481024
AS(5) 56-core 256GB
(7) 32-core 64GB

Katharine Corum

5041728
BH(1) 56-core 512GBBin He56512

BIGREDQ

(13) 56-core 256GB

Sara Mason

7283328

BIOLOGY

(1) 56-core 256GB

Matthew Brockman

56256

BIOSTAT

(2) 56-core 128GB
(2) 32-core 256GB
(1) 24-core 512GB
Patrick Breheny
Grant Brown
Yuan Huang
Dan Sewell
Brian Smith
2001280
BIO-INSTR(3) 56-core 256GBBrad Carson
Bin He
Jan Fassler
168768
CBIG

(1) 56-core 256GB with P100 accelerator

(1) 64-core 192GB, not yet equipped with accelerators

Mathews Jacob

120

448
CBIG-HM(1) 56-core 512GB with P100 acceleratorMathews Jacob56512
CCOM(18) 56-core 512GB
5 running jobs per user 

Boyd Knosp

10089216
CCOM-GPU(2) 56-core 512GB with P100 accelerator

Boyd Knosp

1121024

CGRER

(10) 56-core 128GB
(6) 32-core 64GB

Jeremie Moen

7521664
CHEMISTRY(3) 56-core 256GB

Brad Carson

168768

CLAS-INSTR

(2) 56-core 256GB

Brad Carson

112512
CLAS-INSTR-GPU

(2) 40-core 192GB with 1080Ti accelerators

(One node with single, one node with two accelerators)

Brad Carson80384
CLL(5) 56-core 128GB
(10) 32-core 64GB
(1) 32-core 256GB

Mark Wilson
Brian Miller 

6321536
COB(2) 56-core 256GB
(1) 32-core 64GB
Brian Heil144576
COB-GPU(1) 40-core 192GB with (2)Titan V acceleratorsBrian Heil40192
COE

(10) 56-core 256GB
(11) 32-core 256GB
(10) 32-core 64GB

Note: Users are restricted to no more than three running jobs in the COE queue.

Matt McLaughlin

12326016
COE-GPU

(2) 40-core 192GB with (4) Titan V accelerators

(2) 40-core 192GB with (4) 1080Ti accelerators

Matt McLaughlin160768

DARBROB

(1) 56-core 256GB

Benjamin Darbro

56256
FERBIN(13) 56-core 128GBAdrian Elcock7281664

MF

(6) 56-core 128GB
(1) 32-core 64GB

Michael Flatte

368832
MF-HM(2) 56-core 512GBMichael Flatte1121024

FLUIDSLAB

(8) 56-core 128GB

Mark Wilson
Brian Miller

4481024
AIS(1) 56-core 256GBGrant Brown56256

GEOPHYSICS

(3) 56-core 128GB

William Barnhart

168384
GV(2) 56-core 256GB

Mark Wilson
Brian Miller

112512
HJ(10) 56-core 128GB
(11) 32-core 64-GB
(4) 32-core 256-GB
(1) 24-core 512-GB
Hans Johnson10643520
HJ-GPU(1) 56-core 512GB with P100 acceleratorHans Johnson56512
IFC(10) 56-core 256GB 

Mark Wilson
Brian Miller

5602560
IIHG(10) 56-core 256GB

Diana Kolbe

560256

INFORMATICS

(12) 56-core 256GB
(15) 32-core 256GB
Ben Rogers,
UI3 Faculty
11526912

INFORMATICS-GPU

(2) 56-core 256GB with Titan V accelerators
(2) 40-core 192GB with (3) Titan V accelerators

Ben Rogers192896

INFORMATICS-HM-GPU

(1) 56-core 512GB with (2) P100 acceleratorsBen Rogers56512
IVR(4) 56-core 256GB
(1) 56-core 512GB
(4) 32-core 64GB
(1) 32-core 256GB

Todd Scheetz

4402048
IVR-GPU(1) 56-core 512GB with K80 acceleratorTodd Scheetz561536
IVRVOLTA(4) 56-core 512GB with Titan VMike Schnieders2242048
IWA(11) 56-core 128GB

Mark Wilson
Brian Miller

6161408
JM(3) 56-core 512GB
(2) 32-core 512GB
(2) 24-core 512GB

Jake Michaelson

2803584
JM-GPU(1) 56-core 256GB with P100 acceleratorJake Michaelson56512
JP(2) 56-core 512GB

Virginia Willour

1121024
JS(10) 56-core 256GBJames Shepherd5602560
LUNG(2) 56-core 512GB with P40 acceleratorJoe Reinhardt1121024
MANSCI(1) 56-core 128GB
(1) 32-core 64GB

Qihang Lin

88192
MANSCI-GPU(1) 56-core 512GB with P100 acceleratorQihang Lin56512
MANORG(1) 56-core 128GBMichele Williams/Brian Heil56128

MORL

(5) 56-core 256GB

William (Daniel) Walls

2801280
MS(5) 56-core 256G with (2) P100 GPUs
(7) 40-core 96G with (4) 1080Ti GPUs
(1) 40-core 96G with (4) Titan V GPUS
(20) 32-core 64GB

Mike Schnieders

12403328
NEURO(1) 56-core 256GBMarie Gaine/Ted Abel56256
NOLA(1) 56-core 512GBEd Sander56512
PINC(6) 56-core 256GBJason Evans3361536
REX(4) 56-core 128GB

Mark Wilson
Brian Miller

224512
REX-HM(1) 56-core 512GB

Mark Wilson
Brian Miller

56512
SB(4) 56-core 128GB
(2) 32-core 64GB

Scott Baalrud

288576
STATEPI(1) 56-core 256GBLinnea Polgreen56256
UDAY(4) 56-core 128GB

Mark Wilson
Brian Miller

224512
UI(20) 56-core 256GB
(66) 32-core 64GB
 32329344

UI-DEVELOP

(1) 56-core 256GB
(1) 56-core 256GB with P100 accelerator
 112512
UI-GPU

(5) 56-core 256GB with P100 accelerator
(2) 40-core 192GB with (4) 1080Ti accelerators
(4) 40-core 192GB with (4) Titan V accelerators
(1) 40-core 192GB with (2) Titan V accelerators

 5602624
UI-HM(5) 56-core 512GB
(3) 24-core 512GB
 3524096
UI-MPI

(19) 56-core 256GB

 10644864
all.q

(174) 32-core 64GB
(115) 56-core 128GB
(2) 64-core 192GB
(49) 32-core 256GB
(154) 56-core 256GB
(8) 24-core 512GB
(2) 32-core 512GB
(42) 56-core 512GB
(10) 32-core 64GB with (1) K20 accelerator
(1) 32-core 256GB with (1) K20 accelerator
(2) 56-core 512GB with (1) K80 accelerator
(6) 56-core 256GB with (1) P100 accelerator
(4) 56-core 256GB with (2) P100 accelerators
(8) 56-core 512GB with (1) P100 accelerator
(2) 56-core 512GB with (2) P100 accelerators
(2) 56-core 512GB with (1) P40 accelerator
(3) 56-core 256GB with (1) Titan V accelerator
(4) 56-core 512GB with (1) Titan V accelerator
(1) 40-core 96GB with (4) Titan V accelerator
(6) 40-core 192GB with (4) Titan V accelerators
(2) 40-core 192GB with (3) Titan V accelerators
(2) 40-core 192GB with (2) Titan V accelerators
(7) 40-core 96GB with (4) 1080Ti accelerators
(4) 40-core 192GB with (4) 1080Ti accelerators
(1) 40-core 192GB with (2) 1080Ti accelerators
(1) 40-core 192GB with (1) 1080Ti accelerator


 28064113728
NEUROSURGERY(1) 56-core 512GB with K80 accelerator

Haiming Chen

56512
SEMI(1) 56-core 128GB

Craig Pryor

56128
ACB(1) 56-core 256GBAdam Dupuy56256
FFME(16) 56-core 128GBMark Wilson8962048
FFME-HM(1) 56-core 512GBMark Wilson56512
RP(2) 56-core 512GBRobert Philibert1121024
LT

(2) 56-core 512GB with P100 accelerator
(1) 32-core 256GB with K20 accelerator
(1) 24-core 512GB

Luke Tierney1681792
KA(1) 56-core 512GB
(1) 32-core 256GB
Kin Fai Au88768
BA(2) 32-core 64GB

Bruce Ayati

64128
DAWSON(1) 32-core 256GBJeff Dawson32256
FISH(9) 32-core 64GBLarry Weber288576
GW(2) 32-core 256GBGinny Willour64512
JES(1) 56-core 512GBJacob Simmering56512
MP(1) 32-core 256GBMiles Pufall32256
PABLO(6) 32-core 64GB
(1) 32-core 256GB

224640
SH(5) 32-core 256GB

Shizhong Han

1601280
SHIP(10) 32-core 64GBFred Stern320640
TB(1) 32-core 64GBTerry Braun3264



The University of Iowa (UI) queue

A significant portion of the HPC cluster systems at UI were funded centrally. These nodes are put into queues named UI or prefixed with UI-.

These queues are available to everyone who has an account on an HPC system. Since that is a fairly large user base there are limits placed on these shared queues. Also note that there is a limit of 10000 active (running and pending) jobs per user on the system.

Centrally funded queuesNode DescriptionWall clock limitRunning jobs per user
UI

(20) 56-core 256GB
(66) 32-core 64GB

None2
UI-HM

(5) 56-core 512GB
(3) 24-core 512GB

None1

UI-MPI
(56 slot minimum)

(19) 56-core 256GB

48 hours
UI-GPU

(5) 56-core 256GB with P100 accelerator
(2) 40-core 192GB with (4) 1080Ti accelerators
(4) 40-core 192GB with (4) Titan V accelerators
(1) 40-core 192GB with (2) Titan V accelerators

None1
UI-DEVELOP(1) 56-core 256GB
(1) 56-core 256GB with P100 accelerator 
24 hours1

Note that the number of slots available in the UI queue can vary depending on whether anyone has purchased a reservation of nodes. The UI queue is the default queue and will be used if no queue is specified. This queue is available to everyone who has an account on a UI HPC cluster system. 

Please use the UI-DEVELOP queue for testing new jobs at a smaller scale before committing many nodes to your job.

In addition to the above, the HPC systems have some nodes that are not part of any investor queue. These are in the /wiki/spaces/hpcdocs/pages/76513448 and are used for node rentals and future purchases. The number of nodes for this purpose varies.

(66) 32-core 64GB