What are array jobs
Array jobs are typically though of as a series of identical, or near identical, tasks that are run multiple times as part of a job. Each array task is assigned an index number, SGE_TASK_ID
. This ID is used to differentiate the tasks, and the value of the variable can be used for various purposes as part of the job. In the simplest case, the ID is just used as part of the JOB_ID for the stdout/stderr.
Each job in the array will inherit the same resource requests and attribute allocations as if they were entered independently as Batch Jobs. These jobs will be run concurrently, provided enough resources are available. Array jobs are run by adding the following option to the qsub command (or the #$ directive in the job script):
-t n[-m[:s]]
Where n=the lowest index number, m=the highest index number, and s=the step size. m and s are optional, which means that you could enter a single number (n), a simple range (n-m), or a range with step size (n-m:s).
Generating Array Jobs
There are essentially two ways of generating array jobs, or more accurately two types of array jobs that dictate how they are generated.
Natural Array Jobs
The first type is what will be referred to as a natural array job, or one that does require any special handling to submit. For example, say that you want to run 100 simulations with your program using the same input file. Assume that the program generates a random seed and you do not care what the seed is, because you only care about the distribution metrics from the population of simulation results. You could have a job script that looks like:
$# -q all.q $# -pe smp 20 $# -cwd # run simulation my_prog -i my_input
You would launch an array job with:
qsub -t 1-100:1
That would launch a single jobs with 100 array tasks with the output for each going to $JOB_ID.$SGE_TASK_ID.
What if you decided that you wanted to run 100 simulations with 100 slightly different input files? That will require a bit more work. Presumably, you could generate the input files via a script, but when generating them, remember the $SGE_TASK_ID
variable that can be used. So, if the input files are named in a sequence of
my_input_1 my_input_2 ... my_input_100
your job file could look like
$# -q all.q $# -pe smp 20 $# -cwd # run simulation my_prog -i my_input_$SGE_TASK_ID
Again, the job submission would be
qsub -t 1-100:1
Each array task would use the input file referenced with the indexed value of $SGE_TASK_ID
.
Things can get complicated pretty quickly and so the natural array jobs are limited. It would always be possible to maintain an index file that could be used to determine what the parameters of a job are and which index number those correspond to, but there is a better way.
Array jobs from a task file
Once the limits are reached for indexing jobs easily, another technique is needed. The question is, how to use an index, even if the index number is not useful in a file name. The answer is to use a two step process
- create a task file where each line represents a command to run
- create a job script around the task file to process it, line by line
Each line of the task file must be incorporated into an array task of the job. The processing of the task file will look very similar for all jobs and so having a tool to handle it is useful, and we provide one, called qbatch
. However, generating the task file is something that is dependent on the jobs, and up to each user, but would generally be done with a script.
Generating a task file
Expanding on the simulation case above, consider the case where two parameters need to be set to different values. For instance, say we want to vary p1 and p2 over 10 values each and want to submit all of the jobs as an array job. A loop such as the following could be used to generate a task file.
#!/bin/sh # clear task list cat /dev/null > my_taskfile for p1 in {1..10} do for p2 in {1..10} do echo "my_prog -i my_input -p1 $p1 -p2 $p2 >my_output_p1-${p1}_p2-${p2}" >> my_taskfile done done
The example redirects stdout but that is not strictly necessary.
Creating the job script
The job script in this case will not specify the command for the computation, as that is in the task file. Instead, the task file will use the value of $SGE_TASK_ID
and correlate that with the line numbers of the task file. The command line is captured for that particular array task and will be run when the queue system launches the array task. The details of generating the job script will not be covered here as we provide the qbatch
tool to handle the details.
Using qbatch
The qbatch
program is a tool to submit a list of commands in a file (task file) to a queue. The official documentation can be found at qbatch/README.md at master · pipitone/qbatch.
Some of the options of qbatch
need a bit of explanation to make the best use of it on University of Iowa HPC clusters. There are defaults set for qbatch
such that in many cases you could just execute the following:
qbatch my_taskfile
That will create a job script file in the .qbatch
folder of the working directory and submit it as an array job. The job will request the all.q
queue using the name of the task file as the job name. It will set the current working directory and output stdout/stderr
to the logs directory. Options for the queue, the parallel environment, the number of slots, the job name, and some other options can be specified with arguments. It is also possible to specify all qsub options and pass those on to the eventual call to qsub
.
Default setting for qbatch
The relevant default settings for qbatch
on Argon are:
Processors per core: QBATCH_PPJ=1 Chunksize: QBATCH_CHUNKSIZE=1 Cores: QBATCH_CORES=1 System: QBATCH_SYSTEM=sge SGE_PE: QBATCH_SGE_PE=smp Queue: QBATCH_QUEUE=all.q
Those, and other settings, can be changed either via variables or the command line. You should not change the setting for system.