...
Once your data is uploaded to the cluster, you are ready to work on getting a job submitted to the cluster. If you are going to use one of our centrally installed software packages, you'll need to load the module for it into your environment. More in-depth information on this is on our Environment Modules page. Basically, you'd use the command
No Format |
---|
...
module avail |
To list the modules available to choose from. Then you'd use $module
No Format |
---|
module load <module- |
...
name> |
To load that module into your environment. Note that some modules are not compatible, and will not load together. For example, openmpi_intel_1.4.1 will only work with intel_11.1.072, so attempting to load a newer intel module will fail. The moduling system is aware of most conflicts, however and will automatically load the correct dependent modules. You The environment module system will automatically load the correct dependent modules. You may also use the
No Format |
---|
...
module show <module- |
...
name> |
To to see what other modules will be loaded along with it, and also what modifications it will make to your environment.
...
Another factor to consider is how many resources you need to request from the cluster for your job. In High Performance Computing, resources are parceled out in units called "slots". A slot is a combination of a cpu & ram allocated based on the memory available from the nodes where your job will be running. Each of our clusters has different types of machines inside of it which are defined by the number of cores and the amount of ram that each offers. Slots from each resource will be defined accordingly. For example, Neon has 3 different node types: standard (16 cores, 64G), medium memory (16 cores, 256G) and high memory (24 cores, 512G). Slots from each resource will be defined accordingly. A standard node for a node with 64G of memory, a slot will be 1CPU + & 4G RAM, while for a medium 256G memory node, a slot would be a proxy for 1CPU + & 16G RAM. More detailed information on slots is available in the /wiki/spaces/hpcdocs/pages/76514711. Once you have an idea of how Once you have an idea of how many processors, and/or how much memory your computation will need, you can use this information to calculate how many slots you will need to request for your job.
For example, if your computational problem is to process data from thousands of large image files, you'd need to first figure out how much memory is required to process one file, and extrapolate accordingly. If processing each image requires 2G RAM, and a standard node offers 4G per slot, you could request one slot for each image on Neon. On Helium, slots are smaller, so you'd need to request 2 slots per image.
You may find that doing small prototyping jobs are necessary in order to come up with an accurate resource request. For this, both ITS-RS clusters offer a small "sandbox" queue where you may run small versions of your jobs, or run interactively in order to get an idea of how your job will run on the clusters.
...
Our clusters use the SGE scheduler to match job submissions with available resources. There is extensive documentation on using SGE and all the options available. We offer pages on Basic Job Submission and Advanced Job Submission for both our clusters. Common ways of launching jobs are Launching jobs is done via qsub with options on the command line , or via a special commands in the job submission script which contains special commands which are then passed to the scheduler for controlling your job. A qsub script can be very simple, as much as a few commands, or very complex, depending on your needs.
You may also forego the use of a script, and simply use the "qsub" command with the desired options on the command line. Note that if you use qsub with a options in the job script, then any additional options you pass to qsub on the command line when you launch the script will override those same settings inside the script. For example, if your script specifies the UI queue with #$ -q UI
, but you would like to do a submission to the sandbox queue for prototyping, you can override the UI queue on the command line with qsub:
...
Once you have launched your job using qsub, you will want to be able to monitor its progress on the cluster. There are a couple of simple ways to do this. First, you need to find out what the "jobid" is of your job. A jobid is a unique number assigned by SGE to each job. To get this information, you can use the qstat command like so:
No Format |
---|
...
qstat -u |
...
<username> |
which will produce the following output:
Code Blocknoformat | ||
---|---|---|
| ||
[naomi@neon-login-0-1 espresso]$ qstat -u naomileslater job-ID prior prior name user state submit/start at queue slots ja-task-ID ------- slots ja-task-ID ----------------------------------------------------------------------------- 8853637 0.00000 QE-CO2-Tes naomi qw 01/05/2015 16:42:09 16------------------------------------ 189293 0.50223 CNRM leslater r 11/30/2016 11:24:12 UI@neon-compute-7-34.local 1 194873 0.50217 BCCCSM1 leslater r 11/30/2016 13:11:53 UI@neon-compute-5-27.local 1 379167 0.50094 CNRM_380 leslater r 12/02/2016 10:03:07 all.q@neon-compute-2-34.local 1 |
Note the "job-ID" is in the leftmost column. Other columns of note are the "state" column, which tells you what state your job is in. In this case the job is in "qwr" or "queue waitrunning" state, which means it is waiting to launchhas been assigned a node and is running. The "queue" column in this case is blank as the job has not yet launched in any queue. Once the job launches, the queue will be listedindicates which queue instance (queue+host) the job is running on. The "slots" column tells how many slots the job is requestingrequested. Use Use the qstat -j <jobid> command to view additional details about your job (note the below is abbreviated output):
...
The above information gives an overview of how your job looks to the scheduler. You can see job submission & start times, queue requests, slot requests, and the environment loaded at the time of job submission. One of the most useful lines in this output, however, is the "usage" line. This line will show you peak resource usage of your job. Pay special attention to "maxvmem" as this is the peak memory used by your job up to that point; you can use this information to help determine if you have requested enough resources for your job to operate with.
...
This was a high-level introduction to HPC computing, and there are many topics not covered by this wiki. Our other wiki pages offer much more in-depth detail on various aspects of our systems and how to use them. We also offer consulting services, so if you have questions about using our resources, HPC in general, or would simply like additional assistance getting started, please do not hesitate to contact our staff: hpc-sysadmins@iowa.uiowa.edu, and one of us will be happy to help you.