...
Like previous UI HPC systems, Argon uses SGE, although this version is based off of a slightly different code-base. If anyone is interested in the history of SGE there is an interesting writeup at History of Grid Engine Development. The version of SGE that Argon uses is from the Son of Grid Engine (SoGE) project. For the most part this will be very familiar to people who have used previous generation generations of UI HPC systems. One thing that will look a little different is the output of the qhost command. This will show the CPU topology.
...
As you can see that shows the number of cpus (NCPU), the number of CPU sockets (NSOC), the number of cores (NCOR) and the number of threads (NTHR). This information could be important as you plan jobs but it essentially reflects what was said in regard to HT cores. Note that all argon nodes have the same processor topology. SGE uses the concept of job slots which serve as a proxy for the number of cores as well as the amount of memory on a machine. Job slots are one of the resources that is requested when submitting a job to the system. As a general rule, the number of job slots requested should be equal to or greater than the number of processes/threads that will actually consume resources.
Using the Basic Job Submission and Advanced Job Submission pages as a reference, how would one submit jobs taking HT into account? For single process high throughput type jobs it probably does not matter, just request one slot per job. For multithreaded jobs, request a one job slot per thread. So if your application runs best with 4 threads then request something like the following.
...
That will run on two physical cores and two HT cores. For non-threaded processes that are also CPU bound you can avoid running on HT cores by requesting 2x the number of slots as cores that will be used. So, it if your process is a non-threaded MPI process, and you want to run 4 MPI ranks, your job submission would be something like the following.
...
That would run the 4 MPI ranks on physical cores and not HT cores. That works because, unless overridden, we have set the default SGE core binding strategy to linear. Unless , overridden by the user this strategy will bind processes to the next available core. Since HT cores are mapped after all physical cores this will fill the actual cores first. Once the slots are used, as they will be because the number of slots is 2x the number of cores, the HT cores would be effectively blocked. Note that this will work for non-MPI jobs as well. If you have a non-threaded process that you want to ensure runs on an actual core, you could use the same 2x slot request.
...
Note that if you do not use the above strategy then it is possible that your job process shares a core with another job process. That may be okay, and preferred for high throughput jobs, but is something to keep in mind.
...
While SoGE is very similar to previous versions of SGE there are some new utilities that people may find of interest. There are manual pages for each of these.
- qstatus: Reformats output of qstat and can calculate job statistics.
- dead-nodes: This will tell you what nodes are not physically participating in the cluster.
- idle-nodes: This will tell you what nodes do not have any activity on them.
- busy-nodes: This will tell you what nodes are running jobs.
- nodes-in-job: This is probably the most useful. Given a job ID it will list the nodes that are in use for that particular job.
...
On previous UI HPC systems it was possible to briefly ssh to any compute node, before getting booted from that node if a registered job was not found. This was sufficient to run an ssh command, for instance, on any node. This is not the case for Argon. SSH connections to compute nodes will only be allowed if you have a registered job on that host. Of course, qlogin sessions will allow you to login to a node directly as well. Again, if you have a job running on a node you can ssh to that node in order to check status, etc. You can find the node nodes of a job with the nodes-in-job
command mentioned above. We ask that you not do anything except more than observe things while logged into the node as it may have shared jobs on it.
...
While there are many software applications installed from RPM packages, many commonly used packages, and their dependencies, are built from source. See the Argon Software List to view the packages and versions installed. Note that this list does not include all of the dependencies that are installed, which will consist of newer versions than those installed via RPM. Use of these packages is facilitated through the use of environment modules, which will set up the appropriate environment for the application, including loading required dependencies. Some packages like Perl, Ruby, R and Python, are extendable. We build a set of extensions based on commonly used and requested extensions so loading modules for those will load all of the extensions, and dependencies needed for the core package as well as the extensions. The number of extensions installed, particularly for Python and R is too large to list here. You can use the standard tools of those packages to determine what extensions are installed.
...
Like previous generation UI HPC systems, Argon uses environment modules for managing the shell environment needed by software packages. Argon uses LMod rather than the TCL modules used in previous generation UI HPC systems. More information about Lmod can be found in the Lmod: A New Environment Module System — Lmod 6.0 documentation. Briefly, Lmod provides improvements over TCL modules in some key ways. One is that Lmod will automatically load and/or swap dependent environment modules when higher level modules are changed in the environment. It can also temporarily deactivate modules if a suitable alternative is not found, and can reactivate those modules when the environment changes back. We are not using all of the features that Lmod is capable of so the modules behavior should be very close to previous systems but with a more robust way of handling dependencies. There is a module spider command that can be used to list modules . This is but the module avail
command is present as well. The module spider
command is really designed for a hierarchical module layout, which Argon does not use, so there is little benefit to using module spider
versus module avail
to list the installed module files on Argon.
An important point needs to be made for those who like to load modules in their shell startup files, ie., ~/.bashrc
. One of the things that environment modules sets up is the $LD_LIBRARY_PATH
. However, when a setuid/setgid program runs it unsets $LD_LIBRARY_PATH
for security reasons. One such setgid program is the duo login program that runs as part of an ssh session. This will leave you with a partially broken environment as a module is loaded, sets $LD_LIBRARY_PATH
but then has it get unset before shell initialization is complete. This is worked around on previous systems by always forcing a reload of the environment module but this is not very efficient. This scenario should not be an issue on Argon as all software is built with RPATH support, meaning the library paths are embedded in the binaries. In theory, $LD_LIBRARY_PATH
would not be needed but this is something to keep in mind if you are loading modules from your ~/.bashrc
or similar.
Lmod provides a mechanism to save a set of modules that can then be restored. For those who wish to load modules at shell startup this provides a better mechanism than calling individual module files. The reasons are that
...
Other than the above items, and some other additional features, the environment modules controlled by Lmod should behave very similarly to the TCL modules on previous UI HPC systems.
...
Unix attributes were recently added to the campus Active Directory Service and Argon will be making makes use of those. One of those attributes is the default Unix shell. This can be set via the following HawkID tool: Set Login Shell - Conch. Most people will want the shell set to /bin/bash
so that would be a good choice if you are not sure. For reference, previous generation UI HPC systems set the shell to /bin/bash
for everyone, unless requested otherwise. We recommend that you check your shell setting via the Set Login Shell - Conch tool and set it as desired before logging in the first time. Note that changes to the shell setting may take up to 24 hours to become effective on Argon.
...
Finally, there is clusterwide queue called the all.q queue. This queue encompasses all of the nodes and contains all of the available job slots. It is available to everyone with an account and there are no running job limits. However, it is a low priority queue on the same nodes as the higher priority investor and UI queues. The all.q queue is subordinate to these other queues and jobs running in it will give up relinquish the nodes they are running on when jobs in the high priority queues need them. The term we use for this is "job eviction". Jobs running in the all.q queue are the only ones subject to this.
...