Page Comparison

High throughput jobs, that is, computing work which consists of many small jobs, can put a strain on the home directory (NFS) serverservers. In some cases, this results in failed jobs for the high throughput jobs and slow performance for other users.

Description of the problem

The home Home directories are stored on a single NFS server. Typically, storing home directories on an NFS server does not present any problems; however, with certain high throughput workloads, it can. The issue arises when many different compute nodes try to create and access files located on the home directory server simultaneously. When the number of files becomes large enough, the home directory server falls behind, separate file servers and connected via network to login and compute nodes using NFS. This design works well for most cases, but when many compute nodes simultaneously try to create, read, or delete files in home directories, access becomes slow for all users , and some jobs for the high throughput user start to fail. The job failure errors typically state either that a file or directory does not exist or that access to a file or directory was denied, even though the file/directory does exist and the user has the appropriate permissions to access it.; for example:

Info

title	An example of the "does not exist" error a user received when his high-throughput job failed

failed changing into working directory:05/02/2012 01:02:47 [21548:26444]:
error: can't chdir to /Users/username/subdirectory

Some high throughput workloads can generate enough traffic to significantly increase likelihood of this occurrence, or even cause this condition by themselves. Also, by hypothesis any "high throughput" workload has a relatively higher incidence of its own jobs failing when the condition occurs.

Most common case

The default behavior is for the job scheduler (Sun Grid Engine) to place redirected output files in the user's home directory. (Files ending in .onnnn and .ennnn , where nnnn is the job number). When a user starts many small jobs simultaneously, a large number of compute nodes will attempt to create .o and .e files and log the output and error streams from each job. The many NFS operations this entails create a very high load on the NFS server. This is the most common case, but not the only one which can trigger the problem.

Other cases

Other cases involve having reading or writing many thousands of files for your jobs which are accessed (read or written(sometimes thousands) at the start or end of each job, or even leaving many files open inside each job so that they are cleaned up suddenly at the same time when the function (or program, or entire job) ends. A high throughput workload containing many such jobs (sometimes thousands) can multiply these effects to problematic proportions very easily.

Mitigation strategy

The problem can be mitigated by reducing the number of requests sent by compute nodes to the home directory server. We have worked with several people to implement the following mitigation steps options and achieved a successful outcome.

Redirect standard output and standard error to either /dev/null or to /localscratch. If you do not care about the standard output and standard error streams, use /dev/null. If you do care about them, use /localscratch.
To redirect
streams as follows:
1. If you don't want to examine the streams, you can discard them both: join error into output and send output to /dev/null , use the option by using -j y -o /dev/null in your job script(s). (If you have the -o option already defined in your job script, you will need to replace ituse this instead.)
2. Send your .o and .e files to If you do want to examine one or both streams, first redirect the corresponding files into /localscratch while the job is running, then copy them back to your home directory once the job has finished.
  To redirect to /localscratch, use the option -o /localscratch.
  If you are merging the output (with the -j y option), you only need to move one file back to your home directory at the end of the job. Add . Then at the end of the job, move the file(s) to network storage for accessible outside your job script for later review or processing.
  To keep standard error, redirect the .e file to /localscratch using -e /localscratch. Otherwise join it into the .o file using -j y or discard it using -e /dev/null.
  To keep standard output, redirect the .o file to /localscratch using -o /localscratch. Otherwise discard it using -o /dev/null.
  At the end of the job, move whatever files you didn't discard off the compute node's /localscratch drive into a location you can access later by adding the line below to the end of your script. In the this example below, I am collecting all my output the .o and .e files into a subdirectory under my home directory called "output", but you can move them to whatever folder you like.wherever you prefer:
  Code Block
  title with for non-merged output (-j y)
  mv $SGE_STDOUT_PATH ~/output mv $SGE_STDERR_PATH ~/output
  Note: If you
  are not merging output (you are not specifying
  merge the error stream into the output stream (that is, you specified the -j y option
  in your program
  ),
  then you will
  there will be no .e file, so you only need to move
  both
  the .o
  and .e files back:
  Code Block
  title for non-merged output
  mv $SGE_STDOUT_PATH ~/output mv $SGE_STDERR_PATH ~/output
If you are reading or writing to
1. at the end of the job.
If your high throughput jobs need to read or write a large number of files during your high throughput jobs, you should add code commands to your job script to first copy them under /localscratch, work with them there, and then copy any results back to your home directory upon job completion. This will be job-dependent. If you need help, let us know, any input files to /localscratch, accordingly modify the location where each processing step or program in the job expects input files and writes its intermediate and final result files, and add commands at the end of the job script to clean up and collect result files onto network storage (typically your home directory). The details depend on what your job needs to do, which programs it uses, and generally how it operates. If you would like further advice or assistance, please email us at research-computing@uiowa.edu.

Why this works

The /localscratch area is space that is local to each node. Therefore, the load of constantly accessing and changing these files is distributed across all onto the compute nodes you are using running your jobs instead of being focused on a single network server.

Tip

title	/localscratch accessibility

Because /localscratch is local to each compute node, files in one compute node's /localscratch area will not be accessible by another node.

...

Versions Compared

Old Version 8

New Version 9

Key

Description of the problem

Most common case

Other cases

Mitigation strategy

Why this works