2022.1 Stack
Summary
The 2022.1 Argon Software Stack is quite a bit different on the backend from previous stacks. This leads to some differences in behavior when interacting with the stack and this will be discussed below. The goal is that using this stack should be familiar but hopefully a little better as well. Many packages have been updated and new packages have been added and more will be added over time. As always, please report any issues to research-computing@uiowa.edu.
What is new
Distribution Mechanism
Up until now all of the software stacks have been distributed to login and compute nodes via NFS (Network File System). Beginning with the 2022.1 stack the distribution mechanism will use the CernVM File System, also known as CVMFS. This is designed for software distribution over a wide network and it is felt that this will be an improvement in how software is delivered to the Argon login and compute nodes. Note that this does not apply to the older software stacks which will continue to be deployed via NFS. In addition, for technical reasons, the environment module files and links to binaries are still served via NFS. Finally, licensed software on Argon will continue to be distributed via NFS. This is to avoid violating license terms when the Argon stack is eventually made available to run on systems other than Argon.
Gentoo Prefix Middle Layer
A Gentoo Prefix layer is put on top of the OS layer and the software packages are linked to the libraries of this middle layer. This layer is exported via CVMFS along with the software packages. The goal of this is to make the packages as independent of the underlying OS as possible. The primary motivation for this is to make it possible to upgrade or change the underlying OS without breaking the ability of the software to run on the newer OS. This also lays the foundation for being able to run the Argon software stack on non-Argon systems, but that is further down the road. Finally, this can also play a role in software reproducibility. This is effectively a cross-compiled environment and there are some caveats that come with that, which will be discussed below. Note that there could be some issues due to using a middle layer, so please report if something seems odd or broken.
Native Lua module files for LMod
Argon uses Lmod for environment modules but previous software stacks used TCL based environment module files for historical and technical reasons. Those TCL files needed to be converted to Lua files on the fly, and while that is a pretty fast conversion it does add overhead. The Lmod Lua files should provide a performance boost for searching for and loading modules and also provides extra capabilities.
What this all means
Running software packages
In prior stacks the only purpose of the "stack" environment module was to manipulate the module path such that the respective packages of that stack were made available to load. With the middle layer, loading the "stack/2022.1" module will not only manipulate the module path but it will effectively load a new Linux environment. This will replace many of the OS commands. One of the issues that has been present is that loading an environment module can alter the environment such that system commands break. Good examples of that were editors like vim and emacs. That was alleviated by providing packages for vim and emacs to replace the system ones. The concept is the same here except that now the replacements will be in the stack module itself, with no need to load additional modules. In addition to many common system commands, below is a list of some important packages that are loaded in the environment by the stack/2022.1
module:
GCC
autotools utilities
cmake
gmake
tar
meson
git
ninja
texinfo
perl
python
rust
emacs
vim
The caveat here is that since many commands are replaced, with newer versions than the system commands, there could be differences in behavior when switching between using the 2022.1 stack and other stacks.
Another change in behavior is with regard to environment module dependencies. Since the 2022.1 stack uses Lmod native Lua files now, module dependencies are handled in a more intelligent way. As always, when loading a module, modules for dependencies are also loaded. What is different now is the unloading behavior. Previously, unloading a module would only unload the specified module, leaving the rest of the stack in place, which is probably more modules than desired. With the 2022.1 stack, unloading a module will also unload the dependencies that were previously loaded. In addition, if an underlying dependency is unloaded a message will be printed so at least you know about the now missing dependency. Here is an example to illustrate.
module load stack/2022.1
module list
Currently Loaded Modules:
1) stack/2022.1
Load a module with dependencies
module load py-tensorflow-estimator
module list
Currently Loaded Modules:
1) stack/2022.1 27) py-wheel/0.37.0_gcc-9.4.0 53) py-cffi/1.15.0_gcc-9.4.0
2) cuda/11.4.4_gcc-9.4.0 28) py-astunparse/1.6.3_gcc-9.4.0 54) py-semantic-version/2.8.2_gcc-9.4.0
3) cudnn/8.2.4.15-11.4_gcc-9.4.0 29) py-gast/0.4.0_gcc-9.4.0 55) py-toml/0.10.2_gcc-9.4.0
4) python/3.9.9_gcc-9.4.0 30) py-google-pasta/0.2.0_gcc-9.4.0 56) rust/1.58.1_gcc-9.4.0
5) flatbuffers/1.12.0_gcc-9.4.0 31) c-ares/1.15.0_gcc-9.4.0 57) py-setuptools-rust/0.12.1_gcc-9.4.0
6) intel-oneapi-mkl/2022.0.2_gcc-9.4.0 32) re2/2021-06-01_gcc-9.4.0 58) py-cryptography/36.0.1_gcc-9.4.0
7) rdma-core/39.0_gcc-9.4.0 33) py-grpcio/1.43.0_gcc-9.4.0 59) py-pyjwt/2.1.0_gcc-9.4.0
8) nccl/2.11.4-1_gcc-9.4.0 34) libaec/1.0.5_gcc-9.4.0 60) py-oauthlib/3.1.0_gcc-9.4.0
9) libpciaccess/0.16_gcc-9.4.0 35) pkgconf/1.8.0_gcc-9.4.0 61) py-certifi/2021.10.8_gcc-9.4.0
10) libiconv/1.16_gcc-9.4.0 36) hdf5/1.12.1_gcc-9.4.0-mpi 62) py-charset-normalizer/2.0.12_gcc-9.4.0
11) xz/5.2.5_gcc-9.4.0 37) py-mpi4py/3.1.2_gcc-9.4.0 63) py-idna/3.3_gcc-9.4.0
12) zlib/1.2.11_gcc-9.4.0 38) py-setuptools/59.4.0_gcc-9.4.0 64) py-urllib3/1.26.6_gcc-9.4.0
13) libxml2/2.9.12_gcc-9.4.0 39) py-numpy/1.21.5_gcc-9.4.0 65) py-requests/2.26.0_gcc-9.4.0
14) ncurses/6.2_gcc-9.4.0 40) py-h5py/3.6.0_gcc-9.4.0-mpi 66) py-requests-oauthlib/1.3.0_gcc-9.4.0
15) hwloc/2.7.0_gcc-9.4.0 41) py-keras-preprocessing/1.1.2_gcc-9.4.0 67) py-google-auth-oauthlib/0.4.6_gcc-9.4.0
16) openssl/1.1.1m_gcc-9.4.0 42) py-libclang/11.1.0_gcc-9.4.0 68) py-markdown/3.3.4_gcc-9.4.0
17) libevent/2.1.12_gcc-9.4.0 43) py-opt-einsum/3.3.0_gcc-9.4.0 69) py-tensorboard-data-server/0.6.1_gcc-9.4.0
18) numactl/2.0.14_gcc-9.4.0 44) py-protobuf/3.17.3_gcc-9.4.0 70) py-tensorboard-plugin-wit/1.8.1_gcc-9.4.0
19) opa-psm2/11.2.206_gcc-9.4.0 45) py-cachetools/4.2.4_gcc-9.4.0 71) py-werkzeug/2.0.2_gcc-9.4.0
20) ucx/1.10.1_gcc-9.4.0 46) py-pyasn1/0.4.8_gcc-9.4.0 72) py-tensorboard/2.7.0_gcc-9.4.0
21) libfabric/1.14.0_gcc-9.4.0 47) py-pyasn1-modules/0.2.8_gcc-9.4.0 73) py-termcolor/1.1.0_gcc-9.4.0
22) openssh/8.8p1_gcc-9.4.0 48) py-rsa/4.7.2_gcc-9.4.0 74) py-typing-extensions/3.10.0.2_gcc-9.4.0
23) openmpi/4.1.2_gcc-9.4.0 49) py-google-auth/2.3.2_gcc-9.4.0 75) py-wrapt/1.13.3_gcc-9.4.0
24) protobuf/3.17.3_gcc-9.4.0 50) py-blinker/1.4_gcc-9.4.0 76) py-tensorflow/2.7.0_gcc-9.4.0
25) py-six/1.16.0_gcc-9.4.0 51) libffi/3.4.2_gcc-9.4.0 77) py-keras/2.7.0_gcc-9.4.0
26) py-absl-py/0.13.0_gcc-9.4.0 52) py-pycparser/2.20_gcc-9.4.0 78) py-tensorflow-estimator/2.7.0_gcc-9.4.0
Unload the same module.
module unload py-tensorflow-estimator
module list
Currently Loaded Modules:
1) stack/2022.1
Reload the module.
Unload a dependency.
CUDA GPU
Software that uses CUDA ultimately depends on a kernel module to access the hardware. The library interfaces to the kernel module must be part of the OS layer. In order to facilitate communication between the software packages and the kernel drivers, the libraries are linked in the middle layer. This is the same thing that is done with container images. What is important to note is that there can only be one driver at a time and it will likely change as OS updates happen. This is no different from the past stacks which would have a locked version of cuda, but would use whatever driver is installed in the OS. In other words, there is no Nvidia CUDA driver version that is part of the 2022.1 stack, or any stack, or any container. The only difference here is that there is now a set of links that are part of the stack, but what they point to is part of the OS. This may also become important if the Argon software stack is run on a non-Argon system.
MPI
Similar to CUDA, there are drivers at the OS layer that provide access to the high speed interconnect fabric. The abstracted fabric layers should be able to interface to the underlying drivers, but again, there is a dependency on the underlying OS, where things could change over time or vary across systems.
Development
While the primary purpose of the software stack is for running software the environment modules also provide variables needed for development. This may look a little different with the 2022.1 stack as it is a cross-compiled environment. It is possible that software developed using previous stacks will not run in this environment and would need to be recompiled. This would be due to the use of a different SYSROOT. If you are developing software that uses the 2022.1 stack you will need to make sure that you are not picking up any headers or libraries from the OS in your build system, ie.,
/lib
/lib64
/usr/lib
/usr/lib64
/usr/include
Instead, the above should be prefixed with ${EPREFIX}, which points to the SYSROOT used by the stack. In addition, use the gcc or oneapi compilers from this environment, which are already aware of the SYSROOT. Note that the NVHPC compiler is untested.