Best Practices for High Throughput Jobs

High throughput jobs, that is, computing work which consists of many small jobs, can put a strain on Helium's home directory (NFS) server. In some cases, this results in failed jobs for the high throughput user and slow performance for other Helium users. In order to better inform high throughput users and ensure the best experience for all Helium users, the Helium systems administrators have created this best practices document.

Description of the problem:

The home directories are stored on a single NFS server. Typically, storing home directories on an NFS server does not present any problems; however, with certain high throughput workloads, it can. The issue arises when many different compute nodes try to create and access files located on the home directory server simultaneously. When the number of files becomes large enough, the home directory server falls behind, access becomes slow for all users, and jobs for the high throughput user start to fail. The job failure errors typically state either that a file or directory does not exist or that access to a file or directory was denied, even though the file/directory does exist and the user has the appropriate permissions to access it.

An example of the does not exist error a user receieved when his high-throughput job failed

failed changing into working directory:05/02/2012 01:02:47 [21548:26444]:

error: can't chdir to /Users/username/subdirectory

Most common case:

The default behavior is for the job scheduler (Sun Grid Engine) to place redirected output files in the user's home directory. (Files ending in .onnnn and .ennnn , where nnnn is the job number) When a user starts many small jobs simultaneously on Helium, a large number of compute nodes will attempt to create .o and .e files and log the output and error streams from each job. The many NFS operations this entails create a very high load on the NFS server. This is the most common case, but not the only one which can trigger the problem.

Other cases:

Other cases involve having many thousands of files for your jobs which are accessed (read or written) at the start or end of the job.

Mitigation strategy:

The problem can be mitigated by reducing the number of requests sent by compute nodes to the home directory server. We have worked with several users to implement the following mitigation steps and achieved a successful outcome.

Redirect standard out and standard error to either /dev/null or to /localscratch. If you don't care about the standard out and standard error streams, use /dev/null. If you do care about them, use /localscratch.
1. To redirect to /dev/null, use the option -j y -o /dev/null in your job script(s). (If you have the -o option already defined, you will need to replace it.)
2. Send your .o and .e files to /localscratch while the job is running, then copy them back to your home directory once the job has finished.
  To redirect to /localscratch, use the option -o /localscratch.
  If you are merging the output (with the -j y option), you only need to move one file back to your home directory at the end of the job. Add the line below to the end of your script. In the example below, I am collecting all my output into a subdirectory under my home directory called "output", but you can move them to whatever folder you like.
  with merged output (-j y)
```
mv $SGE_STDOUT_PATH ~/output
```
  If you are not merging output (you are not specifying the -j y option in your program), then you will need to move both the .o and .e files back:
  for non-merged output
```
mv $SGE_STDOUT_PATH ~/output
mv $SGE_STDERR_PATH ~/output
```
If you are reading or writing to a large number of files during your high throughput jobs, add code to your script to first copy them under /localscratch, work with them there, and then copy any results back to your home directory upon job completion. This will be job-dependent. If you need help, let us know.

Why this works:

The /localscratch area is space that is local to each node. Therefore, the load of constantly accessing and changing these files is distributed across all the compute nodes you are using instead of being focused on a single server.

/localscratch accessibility

Because /localscratch is local to each compute node, files in one compute node's /localscratch area will not be accessible by another node.

If you have any questions:

Please let us know if you have any questions or if you have suggestions on how we can improve this document.