Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page presents a list of Frequently Asked Questions (and answers) about using the HPC cluster system at the University of Iowa.

Table of Contents

  1. Why is my home account full even after I delete data from it?

    This occurs because the cluster utilizes snapshots of your home account to provide protection against accidental deletion or corruption, but this means that data isn't actually removed when you delete a file.

    Info

    Note that snapshots are not backups as they will expire.

    The snapshots are not an issue for most users but can become problematic if you are running an application that writes and deletes a lot of data from your home account. There are few different potential solutions to this problem.

    1. If you think this is a problem that won't occur frequently contact the sysadmin team and ask them to delete the offending snapshots.
    2. For applications that do a lot of data read and write into a directory it is preferable to use the scratch file systems (/nfsscratch). The scratch file systems are not backed up. The scratch file systems are also faster than your home account. This is also highly preferable from a system citizenship perspective because the bandwidth to the home accounts is less than is available to the scratch file systems.
    3. Another option if a scratch volume does not work for your job for some reason is to use /localscratch which is the local hard drive of a compute node. This will likely also be faster than your home account and does not have snapshot backups. Note: This will only work if the data does not need to be available to multiple nodes at once.

    4. If you can't migrate to one of the cluster scratch volumes or /localscratch the sysadmin team can decrease the frequency of the snapshots on your home account. This can mitigate the problem if you have an idea of the write intervals of temporary data for your jobs. For instance if all of your jobs finish in a few hours we can set up so that we only snapshot your directory once per day at a time that makes sense with your typical utilization patterns.
    5. If none of the above are viable the sysadmin team can turn off snapshots completely on your home account.

      Warning

      This means that accidental deletion or corruption of files WILL NOT be recoverable by the sysadmin team!


  2. How do I access snapshots of my home account? 

    Snapshots are accessible in ~/.zfs/snapshot. To restore a file copy it using the cp command from a dated subdirectory of the snapshot folder back to your home account. Here is an example of what you might expect to see during the restore process.

    No Format
    languagenone
    [brogers@login-0-1:~]$ cd ~/.zfs/snapshot
    
    [brogers@login-0-1:snapshot]$ ls
    zfs-auto-snap.daily-2011-02-10-00.00   zfs-auto-snap.monthly-2010-10-25-08.58  zfs-auto-snap.weekly-2011-01-22-00.00
    zfs-auto-snap.daily-2011-02-11-00.00   zfs-auto-snap.monthly-2010-11-01-00.00  zfs-auto-snap.weekly-2011-01-29-00.00
    zfs-auto-snap.daily-2011-02-12-00.00   zfs-auto-snap.monthly-2010-12-01-00.00  zfs-auto-snap.weekly-2011-02-01-00.00
    zfs-auto-snap.hourly-2011-02-11-16.00  zfs-auto-snap.monthly-2011-01-01-00.00  zfs-auto-snap.weekly-2011-02-08-00.00
    zfs-auto-snap.hourly-2011-02-11-20.00  zfs-auto-snap.monthly-2011-02-01-00.00
    zfs-auto-snap.hourly-2011-02-12-00.00  zfs-auto-snap.weekly-2011-01-15-00.00
    
    [brogers@login-0-1:snapshot]$ cd zfs-auto-snap.monthly-2011-01-01-00.00/
    
    [brogers@login-0-1:zfs-auto-snap.monthly-2011-01-01-00.00]$ cp computeilos ~/


  3. How do I see the status of just my jobs with qstat? 

    The qstat command defaults to showing the status of all jobs. However, to view the status of just your own or another user's jobs, one can pass the '-u' flag to qstat. So, to see the status of jobs submitted by user jdoe:

    No Format
    qstat -u jdoe
    


  4. Cannot connect to the bus session with FastX version 2 on Argon Cluster?

    When connecting to Argon with FastX version 2 to open a Desktop such as MATE session, sometime you get an error saying "cannot connect to the bus session". This happens particularly if you have installed Anaconda to work with Jupyter on the cluster. Anaconda changes the .bashrc file with the PATH settings and causes the problem in first place. There is a fix available for this particular

    Change your PATH variable in .bashrc 

    FROM:

    No Format
    languagenone
    export PATH="/Users/YOURHAWKID/anaconda2/bin:$PATH”

    TO:

    No Format
    languagenone
    export PATH="$PATH:/Users/YOURHAWKID/anaconda2/bin”

    The most important change is to take the $PATH variable at the end to the beginning of the PATH setting.

  5. I see jobs pending in my queue from people who do not have access.

    There are two possible reasons for this. The first is that the person submitted to the wrong queue. The job will not run in your queue and these are generally cleared out pretty quickly, because the person wants to get their job running. The other case is due to a bug in SGE where the queue listings for array jobs with dependency holds are not accurate. If a predecessor job has not started, because it also has a hold, then jobs will show up in ALL queues in the hqw state. Basically, SGE is not sure yet where the array job belongs so it lists it as being queued in all of the queues. Since these are jobs in a hold state, waiting for other jobs in a hold state, these jobs could display in all of the queues for quite some time. Those jobs will only launch to a queue that the job owner has access to. This erroneous display status is a bug, or perhaps a poor implementation of getting status from array jobs, but there is not much to be done since SGE is not actively developed.