Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Using the Basic Job Submission and Advanced Job Submission pages as a reference, how would one submit jobs taking HT into account? For single process high throughput type jobs it probably does not matter, just request one slot per job. For multithreaded or MPI jobs, request one job slot per thread or process. So if your application runs best with 4 threads then request something like the following.

...

That would run the 4 MPI ranks on physical cores and not HT cores. To help insure the binding you can set the SGE core binding strategy to linear. This strategy will bind the next launched process to the next available core. Since HT cores are mapped after all physical cores this will fill the actual cores first. Once the cores are used, as they will be because the number of slots is 2x the number of cores, the HT cores would be effectively blocked.

Info

If the MPI job will run across multiple nodes then add the --map-by node flag to mpirun to distribute the processes evenly across nodes.

mpirun --map-by node ...

Alternatively, specify a more explicit mapping/binding strategy.

Note that this will work for non-MPI jobs as well. If you have a non-threaded process that you want to ensure runs on an actual core, you could use the same 2x slot request.

No Format
qsub -pe smp 2

Note that if you do not use the above strategy then it is possible that your job process will share cores with other job processes. That may be okay, and preferred for high throughput jobs, but is something to keep in mind. It is especially important to keep this in mind when using the orte parallel environment. There is more discussion on the orte parallel environment on the Advanced Job Submission page. In short, that parallel environment is used in node sharing scenarios, which implies potential core sharing as well. For MPI jobs, that is probably not what you want. As on previous systems, there is a parallel environment (56cpn) for requesting entire nodes. This is especially useful for MPI jobs to ensure the best performance.

Note that core binding is a soft request in SGE. If the binding can not be done the job will still run, if it otherwise has the resources. This is particularly true on machines where jobs are being shared as the actual cores can be bound while still leaving slots available. The only way to assure binding is with dedicated nodes. However, core binding in and of itself may not really boost performance much. Generally speaking, if you want to minimize contention with hardware threads then simply request twice the number of slots than cores your job will use. Even if the processes are not bound to cores, the OS scheduler will do a good job of minimizing contention.

...

Note that this will work for non-MPI jobs as well. If you have a non-threaded process that you want to ensure runs on an actual core, you could use the same 2x slot request.

No Format
qsub -pe smp 2

Note that if you do not use the above strategy then it is possible that your job process will share cores with other job processes. That may be okay, and preferred for high throughput jobs, but is something to keep in mind. It is especially important to keep this in mind when using the orte parallel environment. There is more discussion on the orte parallel environment on the Advanced Job Submission page. In short, that parallel environment is used in node sharing scenarios, which implies potential core sharing as well. For MPI jobs, that is probably not what you want. As on previous systems, there is a parallel environment (56cpn) for requesting entire nodes. This is especially useful for MPI jobs to ensure the best performance.

For MPI jobs, the system provided openmpi will not bind processes to cores by default, as would be the normal default for openmpi. This is set this way  to avoid inadvertentl oiversubcribing processes on cores. In addition, the system openmpi settings will treat the HT cores as processors. This may be important map processes by socket. This should give a good process distibution in all cases. However, if you wish to run hybrid MPI/OpenMP threaded jobs.

...

use less than 28 processes per node in an MPI job then you may want to map by node to get the most even distibution of processes across nodes. You can do that with the --map-by node option flag to mpirun.

No Format
mpirun -
mca hwloc_base_use_hwthreads_as_cpus false

or set the equivalent environment variable in the job script

No Format
OMPI_MCA_hwloc_base_use_hwthreads_as_cpus=false

...

-map-by node ...

If you wish to control mapping and binding in a more fine-grained manner, the mapping and binding parameters can be overridden with parameters to mpirun. Openmpi provides many options fine grained control of process layout. The options that are set by default should be good in most cases but can be overridden with the openmpi options for

...

If you set your own binding, for instance --bind-to core, be aware that the number of cores is half of the number of total HT processors. Note that core binding in and of itself may not really boost performance much. Generally speaking, if you want to minimize contention with hardware threads then simply request twice the number of slots than cores your job will use. Even if the processes are not bound to cores, the OS scheduler will do a good job of minimizing contention.

If your job does not use the system openmpi, or does not use MPI, then any desired core binding will need to be set up with whatever mechanism the software uses. Otherwise, there will be no core binding. Again, that may not be a major issue. If your job does not work well with HT then run on a number of cores equal to half of the number of slots requested and the OS scheduler will minimize contention. 

new SGE utilities

While SoGE is very similar to previous versions of SGE there are some new utilities that people may find of interest. There are manual pages for each of these.

...