The HPC cluster systems use the Sun Grid Engine (SGE) queue scheduler system. The feature of a queue scheduler system that users interact with the most is that of job submission. The manual pages for SGE are very good and should be referred to for details. For this particular topic the qsub manual page is the authoritative source.
No Format |
---|
man qsub |
This document provides a brief introduction to the most common options that might be used to submit jobs to the SGE system. It will focus on single processor jobs as that is the most basic case, but not necessarily the most common. Details on submission of parallel jobs is covered inĀ Advanced Job Submission.
...
render.sh would then be run 4 times, each with the default allocation of resources, with the input file corresponding to the basename + index number.
Exit status
Every process, and therefore every job, has an exit status. If the job completed normally then the exit status is that of the computation process. However, if the job does not complete normally then a value of 128 is added to the exit status of the command. If the command exited due to receiving a signal then the value of the signal is added to 128. This would be common for jobs running in the all.q queue when a job is evicted. A TERM signal is sent to a job when it is evicted. The numerical value of the TERM signal is 15 so the exit status of the job would be 128 + 15 = 143. Note that in some cases a TERM signal is not sufficient to remove a job and a follow up KILL signal will have to be sent. The job exit status would then be 128 + 9 = 137. Another case where a job will exit with status 143 is when memory limits are hit. So if your job has an exit status of 143 and it was not running in the all.q queue then it probably hit the memory limit. This can be further confirmed by examining the accounting record of the job with qacct.