Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

High throughput jobs , that is, computing work which consists of many small jobs, can put a strain on the home directory (NFS) are high volume computing jobs whose typical computing duration is similar to or even shorter than the time required to receive, schedule, start, and stop them. In addition to the general considerations for high volume jobs, high throughput jobs can strain the capabilities of Argon's home directory servers. In some cases, this results in failed jobs for the among all high throughput jobs on the system, and slow performance for other users.

...

Home directories are stored on separate file servers and connected via network to login and compute nodes using NFS. This design works well for most casescluster usage, but when many compute nodes simultaneously try to createwrite, read, or delete files in home directories, access becomes slow for all users and some jobs fail. The job failure errors typically state either that a file or directory does not exist or that access to a file or directory was denied, even though the file/directory does exist and the user has the appropriate permissions to access it; for . For example:

Info
titleAn example of the "does not exist" error a user received when his high-throughput job failed

failed changing into working directory:05/02/2012 01:02:47 [21548:26444]:
error: can't chdir to /Users/username/subdirectory

Some high throughput workloads can generate enough traffic to significantly increase likelihood of this occurrence depending on other user job activity, or even cause this condition by themselves. Also, by hypothesis any "high throughput" workload has a relatively higher incidence of its own jobs failing when the condition occurs.

...

Other cases involve reading or writing many files (sometimes thousands) at the start or end of each job, or even leaving many files open inside each job so that they are cleaned up suddenly at the same time when the function (or entire program, or entire job) ends. A high throughput workload containing many such jobs (sometimes thousands) can multiply these effects to problematic proportions very easily.

...

  1. Redirect standard output and standard error streams as follows:

    1. If you don't want to examine the streams, you can discard them both: join error into output and send output to /dev/null by using -j y -o /dev/null. (If you have the -o option already defined in your job script, you will need to use this instead.)
    2. If you do want to examine one or both streams, first redirect the corresponding files into /localscratch while the job is running. Then at the end of the job, move the file(s) to network storage for accessible outside your job script for later review or processing.

      To keep standard error, redirect the .e file to /localscratch using -e /localscratch. Otherwise join it into the .o file using -j y or discard it using -e /dev/null.
      To keep standard output, redirect the .o file to /localscratch using -o /localscratch. Otherwise discard it using -o /dev/null.

      At the end of the job, move whatever files you didn't discard off the compute node's /localscratch drive into a location you can access later by adding the line below to the end of your script. In this example, I am collecting the .o and .e files into a subdirectory under my home directory called "output", but you can move them wherever you prefer:

      Code Block
      titlefor non-merged output
      mv $SGE_STDOUT_PATH ~/output
      mv $SGE_STDERR_PATH ~/output
      

      Note: If you merge the error stream into the output stream (that is, you specified the -j y option), there will be no .e file, so you only need to move the .o at the end of the job.

  2. Do not load environment modules with each job or array task. Take advantage of the environment passing to jobs that is done by default. See Environment modules for HTC/HVC jobs for more information.
  3. If your high throughput jobs need to read or write a large number of files, you should add commands to your job script to first copy any input files to /localscratch, accordingly modify the location where each processing step or program in the job expects input files and writes its intermediate and final result files, and add commands at the end of the job script to clean up and collect result files onto network storage (typically your home directory). The details depend on what your job needs to do, which programs it uses, and generally how it operates. If you would like further advice or assistance, please email us at research-computing@uiowa.edu.

...