...
University of Iowa Cluster Systems
The University of Iowa currently has multiple a shared clusters HPC cluster available for campus researchers to use. The shared systems are system is run primarily by ITS-Research Services. Our clusters are The cluster is capable of running both High Performance jobs and High Throughput jobs. Collectively, the systems comprise The system is comprised of several hundred compute nodes with more several thousands of processor cores.
What are the differences between High Throughput Computing (Shared Memory) and High Performance Computing (Distributed Memory)?
High Performance computing enables a user to solve a single, large problem by harnessing a large number of processors and memory across multiple compute nodes. These types of problems are typically broken down into pieces and processed in parallel, with different compute nodes working on a different part of the problem. Each node communicates with the other nodes working on the problem via a high-speed interconnect -- in our case, we use InfinibandOmniPath. Parallel processing typically requires code modification in order to utilize a library such as MPI which in turn facilitates parallel communication between the nodes working to solve the problem. Examples of problems that use High Performance Computing are Computational Fluid Dynamics and Molecular Dynamics. More information on using MPI on our HPC systems is available here: MPI Implementation.
...
Getting started with Linux
Our shared clusters run HPC cluster runs CentOS Linux, the version being current at the time of system deployment. In order to make use of the clusterscluster, users will need a basic understanding of how to interact with a Linux system at the command line. At a minimum, you will need to know how to move around the system, copy and edit files. There are many resources on the Internet devoted to helping you learn your way around a Linux system. One One of the best resources available is a book called The Linux Command Line, which is available as a free PDF download here. For a quicker overview of basic Linux commands, there is a good Linux Cheat Sheet here.
...
View file | ||||
---|---|---|---|---|
|
Mapping your work to one of the clusters
If your compute problem is not tractable on a desktop or lab workstation, uses a large amount of memory, requires a rapid turnaround of results, would benefit from being scheduled, then an HPC cluster may be a good fit for you. The next steps are to determine if your job computation runs on Linux, can be run in batch mode (non-interactively) and whether it is a high performance (parallel) or high throughput (serial) job. Determining the answers to these questions will help decide how to go about requesting and utilizing HPC resources. Some Some additional questions to consider are:
- Will you need to recompile your code to run on our cluster?
If you are brining code over from another system, you may need to recompile it to work on our systems, especially if you are using MPI (of which we offer a few different varieties). We have some additional notes on compiling here: Compiling Software - What software will your job need, and is it available centrally, or could it be installed in your home directory?
Our list of installed software is here: Software Installations. If you don't see a package you need, please let us know, and if it is broadly applicable to a number of users, we my install it centrally, or we will help install it into your home directory. - Can you estimate how much memory your job will need?
Knowing approximately how many processes you will need or how much memory to request will help ensure you request enough resources to get your job to complete. One way to discover this is to run a small version of the job to see how much memory it uses and then calculate how much it would use if you were to double or triple it in size. We also offer a small sandbox/ development queue on each of our the HPC clusters cluster that you may submit small jobs to to see how things go, and then tweak your resource requests accordingly.
Getting your data into the cluster
If your data is not largelarge, the quickest way to get your data onto one of the clusters is to use scp, rsync or sftp from the command line or via an application such as Fetch (Mac based) or IPSwitch (windows based). If you have larger data sets (larger meaning several Gigabytes or more), then you can utilize our Globus Online connection.
...
Once your data is uploaded to the cluster, you are ready to work on getting a job submitted to the cluster. If you are going to use one of our centrally installed software packages, you'll need to load the module for it into your environment. More in-depth information on this is on our Environment Modules page. Basically, you'd use the command
No Format |
---|
module avail |
To
...
list
...
the
...
modules
...
available
...
to
...
choose
...
from.
...
Then
...
you'd
...
use
No Format |
---|
module load <module-name> |
...
Another factor to consider is how many resources you need to request from the cluster for your job. In In High Performance Computing, resources are parceled out in units called "slots". A slot is a combination of a cpu & ram memory allocation based on the memory available from the nodes where your job will be running. Each of our clusters The cluster has different types of machines inside of it which are defined by the number of cores and the amount of ram that each offers. Slots from each resource will be defined accordingly. For example, for a node with 64G of memory, a slot will be 1CPU & 4G RAM, while for a 256G memory node, a slot would be a proxy for 1CPU & 16G RAM. Once you have an idea of how many processors, and/or how much memory your computation will need, you can use this information to calculate how many slots you will need to request for your job.
For example, if your computational problem is to process data from thousands of large image files, you'd need to first figure out how much memory is required to process one file, and extrapolate accordingly. If If processing each image requires 2G RAM, and a node offers 4G per slot, you could request one slot for each image.
You may find that doing small prototyping jobs are necessary in order to come up with an accurate resource request. For this, ITS-RS clusters offer a small "sandbox", or development , queue where you may run small versions of your jobs. You may also use qlogin to run interactively in order to get an idea of how your job will run on the cluster nodes.
Launching Your Job
Our clusters use cluster uses the SGE scheduler to match job submissions with available resources. There is extensive documentation on using SGE and all the options available. We offer pages on Basic Job Submission and Advanced Job Submission for our clustersthe cluster. Launching Launching jobs is done via qsub with options on the command line or via special commands in the job script which are then passed to the scheduler for controlling your job. A qsub script can be very simple, consisting of a few commands, or very complex, depending on your needs.
Note that if you use qsub options in the job script, then any additional options you pass to qsub on the command line when you launch the script will override those same settings inside the script. For example, if your script specifies the UI queue with #$ -q UI
, but you would like to do a submission to the sandbox development queue for prototyping, you can override the UI queue on the command line with qsub:
No Format |
---|
qsub -q |
...
UI-DEVELOP <myscript.sh> |
Monitoring Your Job
Once you have launched your job using qsub, you will want to be able to monitor its progress on the cluster. There are a couple of simple ways to do this. First, you need to find out what the "jobid" is of your job. A jobid is a unique number assigned by SGE to each job. To get this information, you can use the qstat command like so:
...
No Format | ||
---|---|---|
| ||
qstat -u leslateraarenas job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 189293 0.50223 CNRM leslater r 11/30/2016 11:24:12 UI@neon-compute-7-34.local 1 194873 0.50217 BCCCSM1 leslater r 11/30/2016 13:11:53 UI@neon-compute-5-27.local 1 44348 1.00692 BC80_9 379167 0.50094 CNRM_380 aarenas leslater r 1201/0210/20162019 1011:0321:0750 all.q@neonIWA@argon-computelc-2h21-3416.localhpc 1 56 |
Note Note the "job-ID" is in the leftmost column. Other columns of note are the "state" column, which tells you what state your job is in. In this case the job is in "r" or "running" state, which means it has been assigned a node and is running. The "queue" column indicates which queue instance (queue+host) the job is running on. The "slots" column tells how many slots the job is requested. Use the qstat -j <jobid> command to view additional details about your job (note the below is abbreviated output):
Code Block | ||
---|---|---|
| ||
$ qstat -j 8853637 ============================================================== job_number: 8853637 exec_file: job_scripts/8853637 submission_time: Tue Jan 6 09:57:10 2015 owner: naomi uid: 1205679 group: its-rs-neon gid: 899998927 sge_o_home: /Users/naomi sge_o_log_name: naomi sge_o_path: <path information here> sge_o_shell: /bin/bash sge_o_workdir: /Users/naomi/jobs/espresso sge_o_host: neon-login-0-1 account: sge cwd: /Users/naomi/jobs/espresso merge: y mail_options: abes mail_list: naomi-hospodarsky@uiowa.edu notify: FALSE job_name: QE-CO2-Test-time jobshare: 0 hard_queue_list: sandbox shell_list: NONE:/bin/bash env_list: <environment information here> script_file: espresso-test.sh parallel environment: 16cpn range: 16 usage 1: cpu=00:01:57, mem=34.41240 GBs, io=0.18064, vmem=N/A, maxvmem=5.972G scheduling info: (Collecting of scheduler job information is turned off) |
...
Two additional commands which may be useful are "qdel" for deleting jobs:
Code Block | ||
---|---|---|
| ||
$ qdel -j <Jobid> # deletes jobs by jobid $ qdel -u <username> # deletes all jobs owned by user |
...
This was a high-level introduction to HPC computing, and there are many topics not covered by this wiki page. Our other wiki pages offer more detail on various aspects of our systems the system and how to use themit. We also offer consulting services, so if you have questions about using our resources, HPC in general, or would simply like additional assistance getting started, please do not hesitate to contact our staff: research-computing@uiowa.edu, and one of us will be happy to help you.