Job metrics with Open XDMoD

An instance of Open XDMoD has been set up for HPC in Research Services, ie. the Argon HPC cluster. The XDMoD manual can be accessed from within the app but is also linked here: Open XDMoD Manual. Historical data goes back to mid-2020 and represents the current state of Argon. The initial view is a public view that show summary information. For more detailed information, login via the "Sign In" button at the top left of the page. You must have an Argon account to be able to access the detailed views. Once logged in you will be able to apply many queries to the data.

There are a few things to note about XDMoD and how this implementation differs from the defaults.

  1. The default time period is set to the past 30 days as opposed to the previous month.
    This will thus always display a relatively current snapshot but note that data is not updated in real time.
    1. Data is updated once per day, near the end of the day so data is about a day behind.
    2. Data is only pulled in from the accounting file, which in turn, only gets updated when jobs end. 
      This means that data from past days will get updated as jobs complete. For example, a job that takes 10 days to complete will not reflect in the metrics of days 1-9 until the data is written on day 10.
  2. The "Jobs" dashboard panels and tab have been removed.
    The detailed jobs view is missing critical pieces of information for each job. This makes it necessary to still use qacct for details of a completed job. In addition, the volume of jobs are Argon makes collecting detailed job records very slow. Since there is such a huge performance penalty for information that is not really useful, the interface elements for individual jobs were removed.
  3. Utilization is calculated as a percentage of CPU hours.
    For those that have been using the utilization reports that we generate you will see a difference in % utilization. This is because the utilization reports are calculated as a percentage of slots used to slots available, regardless of how much CPU time is used. Both metrics are useful but just keep in mind that utilization in XDMoD will tend to be lower than how we have typically defined utilization. Utilization in terms of CPU hours will generally be lower because:
    1. MPI jobs should use only half of the number of slots (cores) requested
    2. Jobs that request more slots to get more memory will leave CPU cores idle
    3. Qlogin jobs tend to not use a lot of CPU time, but do allocate slots
    4. XDMoD utilization is based on total CPU hours of the entire system; our in-house utilization reports are based on slots per queue.
  4. The PI (Principle Investigator) has been set equal to the queue names.
    This is mostly, but not entirely, accurate, but it was the best way to handle the PI attribute since we have not set account or project attributes on any jobs. If you are the owner or manager of a queue you can slice the data based on the queue name(s) of what you own or manage.
  5. GPU metrics are not captured.