Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page presents a list of Frequently Asked Questions (and answers) about using the HPC cluster systems system at the University of Iowa.

...

Table of Contents

How do I access snapshots of my home account

...

...

  1. If you think this is a problem that won't occur frequently contact the sysadmin team and ask them to delete the offending snapshots.
  2. For applications that do a lot of data read and write into a directory it is preferable to run the scratch file systems (/nfsscratch). The scratch file systems are not backed up. The scratch file systems are also much faster than your home account. This is also highly preferable from a system citizenship perspective because the bandwidth to the home accounts is less than is available to the scratch file systems.
  3. Another option if a scratch volume does not work for your job for some reason is to use /localscratch which is the local hard drive of a compute node. This is a symlink to /state/partition1/localscratch if your jobs don't happen to like symlinks. This will likely also be faster than your home account and does not have snapshot backups. Note: This will only work if the data does not need to be available to multiple nodes at once.

    Info

    The amount of /localscratch available on Helium is about 500GB. On Neon, each node has about 2.5TB available for /localscratch

  4. If you can't migrate to one of the cluster scratch volumes or /localscratch the sysadmin team can decrease the frequency of the snapshots on your home account. This can mitigate the problem if you have an idea of the write intervals of temporary data for your jobs. For instance if all of your jobs finish in a few hours we can set up so that we only snapshot your directory once per day at a time that makes sense with your typical utilization patterns.
  5. If none of the above are viable the sysadmin team can turn off snapshots completely on your home account.

    Warning

    This means that accidental deletion or corruption of files WILL NOT be recoverable by the sysadmin team!

...

Snapshots are available as read-only directories in ~/.zfs/snapshot, each representing the state of your home directory at the time the snapshot was created. To restore a file or directory, use the cp command to copy it from a snapshot directory under the snapshot folder to a writable location in your home account (or elsewhere if you prefer). Here is an example of what you might expect to see during the restore process

...

:

No Format
languagenone
[

...

brogers@login-

...

0-1:~]$ cd ~/.zfs/snapshot

[

...

brogers@login-0-1:snapshot]$ ls
zfs-auto-snap.daily-2011-02-10-00.00   zfs-auto-snap.monthly-2010-10-25-08.58  zfs-auto-snap.weekly-2011-01-22-00.00
zfs-auto-snap.daily-2011-02-11-00.00   zfs-auto-snap.monthly-2010-11-01-00.00  zfs-auto-snap.weekly-2011-01-29-00.00
zfs-auto-snap.daily-2011-02-12-00.00   zfs-auto-snap.monthly-2010-12-01-00.00  zfs-auto-snap.weekly-2011-02-01-00.00
zfs-auto-snap.hourly-2011-02-11-16.00  zfs-auto-snap.monthly-2011-01-01-00.00  zfs-auto-snap.weekly-2011-02-08-00.00
zfs-auto-snap.hourly-2011-02-11-20.00  zfs-auto-snap.monthly-2011-02-01-00.00
zfs-auto-snap.hourly-2011-02-12-00.00  zfs-auto-snap.weekly-2011-01-15-00.00

[

...

brogers@login-0-1:snapshot]$ cd zfs-auto-snap.monthly-2011-01-01-00.00/

[

...

brogers@login-

...

0-1:zfs-auto-snap.monthly-2011-01-01-00.00]$ cp computeilos ~/


How do I see the status of just my jobs with qstat? 

The qstat command

...

defaults to showing the status of all jobs. However, to view the status of just your own or another user's jobs, one can pass the '-u' flag to qstat. So, to see the status of jobs submitted by user jdoe:

No Format
qstat -u jdoe

Why is my ~/.bashrc not being sourced? User accounts created before May 2, 2011 did not have the template ~/.bash_profile installed at account creation time. This is the file that contains the statement to source the ~/.bashrc file if it exists. A standard ~/.bash_profile file was installed on May 2, 2011 in user accounts that did not already have one. If you had already created your own ~/.bash_profile file and did not include a statement to source ~/.bashrc then it will not be sourced. To fix this, add the following to your ~/.bash_profile file.


...

languagebash
title~/.bash_profile

...

"Could not connect to session bus" when connecting to Argon using FastX version 2:

When connecting to Argon with FastX version 2, you may see an error saying "Could not connect to session bus: Failed to connect to socket" while starting a graphical desktop such as MATE. The most common cause of this issue on Argon is that you have installed Anaconda using its default settings. Anaconda's installer configures your ~/.bashrc file to automatically activate Anaconda during the login process. But the installer also gives priority to Anaconda software, and because Anaconda includes software which interferes with graphical logins, its presence causes them to fail with this error.

In older versions of Anaconda, the installer simply adds its own path at the start of the PATH variable, so you can work around the problem by moving its path to the end, thus giving Anaconda software lower priority. That is, edit your ~/.bashrc to change the definition like so:

FROM:

No Format
languagenone
export PATH="/Users/YOURHAWKID/anaconda2/bin:$PATH”

TO:

No Format
languagenone
export PATH="$PATH:/Users/YOURHAWKID/anaconda2/bin”

More recent versions of Anaconda configure activation in your ~/.bashrc using a very different mechanism, and a fix analogous to the above is less convenient. In this case, you can leave that configuration in place so that Anaconda itself becomes active during login, but reconfigure Anaconda so that the default "base" environment is not automatically activated during login. You can use the standard conda commands in your shell session or job script to activate any environment when you need to use it.

I see jobs pending in my queue from people who do not have access.

There are two possible reasons for this. The first is that the person submitted to the wrong queue. The job will not run in your queue and these are generally cleared out pretty quickly, because the person wants to get their job running. The other case is due to a bug in SGE where the queue listings for array jobs with dependency holds are not accurate. If a predecessor job has not started, because it also has a hold, then jobs will show up in ALL queues in the hqw state. Basically, SGE is not sure yet where the array job belongs so it lists it as being queued in all of the queues. Since these are jobs in a hold state, waiting for other jobs in a hold state, these jobs could display in all of the queues for quite some time. Those jobs will only launch to a queue that the job owner has access to. This erroneous display status is a bug, or perhaps a poor implementation of getting status from array jobs, but there is not much to be done since SGE is not actively developed. 

Filezilla shows the following error when connecting: 

No Format
Error:        	Server sent an additional login prompt. You need to use the interactive login type.
Error:        	Critical error: Could not connect to server

This is because the Duo prompt can not be shown. In the "Site Manager", select "Interactive" from the drop-down selector for "Logon Type". That will cause a dialog box to pop up where the Duo information can be entered.

Filezilla keeps trying to reconnect

This is the default behavior for filezilla. Set the timeout to 0 in the settings.

System based programs fail to load after loading environment modules

The environment modules set up the environment for the respective applications. While most library paths are baked in, it is still necessary to provide a hint to the path of libraries for many things. Due to this, the LD_LIBRARY_PATH variable is set to ensure that module packages are finding the correct libraries. Unfortunately, that can cause issues when trying to launch non-module programs, ie., programs that are available on the system without using environment modules. If you see error messages related to libraries when launching a system program you will have to unset LD_LIBRARY_PATH. There are two options:

  1. Launch programs from a session without any modules loaded. The environment can be reset with

    No Format
    module reset


  2. Unset LD_LIBRARY_PATH as part of the command. For example

    No Format
    LD_LIBRARY_PATH='' gedit