High Volume Jobs

High Volume jobs are a superset of High Throughput jobs. The difference is that High Volume jobs may be long lasting, but there are just a lot of them. They are also typically submitted to the cluster as a large group of jobs, typically through a loop in a script that repeatedly executes qsub commands. Occasionally, there have been issues with responsiveness of the SGE queue system if many of these types of jobs are being submitted at the same time. This manifests as SGE commands taking a long time to respond or even timeouts of the form:

error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).

Without going into technical details, what is happening is that SGE has more to do than it can handle in the allotted time that is has for processing jobs per scheduling cycle. While it is natural to write a script to submit a 100 or a 1000 jobs, when several such scripts are running at the same time, the rate of job submission overwhelms SGE. SGE will spend most of its cycles trying to submit the jobs, but eventually it has to break from that to schedule jobs to run on the system. Obviously, a high rate of job submissions also produces a large number of jobs that have to be processed for scheduling decisions. This makes the scheduler thread take a longer time to complete. Meanwhile, more jobs are coming in at a high rate, and a snow ball effect begins, eventually leading to time outs, and failed commands, including qsub. SGE will prioritize scheduling over other events, so jobs are still being scheduled even though commands are timing out, but from a user perspective, the system is not working, and indeed, the system is not working as it should.

The only solution to this is to reduce the rate of job submission. The best way to do that is to use array jobs as those can reduce, for example, 1000 job submissions down to a single job submission. If the jobs do not have complex dependencies then it is usually possible to create a task file that will contain the list of computations, and submit the task file as an array job. In order to facilitate converting submission scripts that submit many jobs to creating and submitting task files, we have added a new tool, called "qbatch", and expanded the documentation for array jobs, including usage of qbatch.

If you are submitting a large number of jobs through a script we ask that you please review the Array jobs: with and without a task file wiki page and convert your submission scripts to an array job of a task file. It will make job submission much faster for you, as well as help keep the system responsive for everyone else. We can help with the conversion and answer any questions that you may have.