Turing FAQ

Welcome to the Turing G5 Cluster Frequently Asked Questions page.   We hope the following tidbits of advice will be helpful to you along your way to getting set up and running on the cluster.  Please let us know if something is unclear, does not work in solving your problem, or you find something new that should be included here. We will share your experiences with other users through the enhancement of these pages.  Please mail questions, comments, and suggestions to turing-help@cse.uiuc.edu.


Unable to open GM port

The most common cause of this is when "qsub" and/or "rjq" are not used as directed. Make sure you have specified "ppn=2", never "ppn=1". Also, always use "rjq", never "mpirun". Occasionally, there is indeed a problem with one of the nodes upon which you are trying to run. If you are sure you are using qsub and rjq properly and are still getting this error, please report the problem with the relevant job stdio files to turing-help@cse.uiuc.edu as soon as possible.  We will be able to clean it up pretty quickly or remove the errant node from the batch pool.  Note that it is essential that you include the stdout for the job.

Permission denied -or- Host key verification failed

This error is commonly reported in your batch output or interactive tool output when your known_hosts file has not been updated as described in the tutorial.  To fix this problem, issue the following commands:
rm ~/.ssh/known_hosts
ln -s /dev/null ~/.ssh/known_hosts

Warning: no access to tty (Bad file descriptor), Thus no job control in this shell

Unfortunately, this annoying message is something we live with on this platform.   Fortunately, the message is harmless and can be ignored.  If you are interested, the crux of the problem lies in the implementation of openpty() on OS X, and how the TORQUE resource manager uses it.

Qdel woes:

Of all cluster problems, qdel problems are by far the most common, and unfortunately the most annoying.  TORQUE is not without it's share of problems on all platforms, and in particular, it's beta software on OS X.  However, there aren't a lot of choices out there for resource managers that work on OS X.  On the bright side, many people are working on TORQUE, including Cluster Management Resources, and it is quickly coming up to speed on the platform.  Here's qdel's variety of errors:

Fails to remove job

This is a common one.  Qdel will return without complaining, yet the job in question keeps running happily.   Sometimes jobs can take up to 20 minutes to finally exit after being qdel'd.  If, after this time, there is no reaction, sometimes a qsig -t TERM command will force the job to make it's exit.  Failing that, drop a mail to turing-help@cse.uiuc.edu, and we'll clear the job for you.

Unable to connect to MOM

This usually happens when the job (or sometimes the mom) fails or crashes on the root node of the job.  You will definitely need help clearing this one, drop a mail to turing-help@cse.uiuc.edu with the jobid.

Hangs forever

This is a result of a serious problem we haven't yet tracked down completely.  The queue is on the brink of hanging, the scheduler can no longer contact the queue.  Multiple PBS server processes are spawned.  Basically, it's PBS anarchy.  The PBS server has to be brought down and back up again - contact turing-help@cse.uiuc.edu immediately.  It doesn't take too awfully long to fix this problem, but no jobs will be scheduled until it comes to our attention that this is happening.


Where is my output?

In PBS, the stdout and stderr files are kept local to the root node of the job until the job completes.  Once your job exits, the spooled output and error will be copied to the file you specified in your batch script with "#PBS -o", etc.  This copying process usually takes a few seconds. You can override the output spooling behavior by redirecting the stdout and stderr on the command line:
rjq my_app.exe >& my_output.txt

However, be aware that this may affect your application performance.  See the performance section.   You may also use the qpeek command to have a look at your output as it is collected on the root node.  The syntax is:
qpeek <jobid>

Just type qpeek -help to obtain a usage guide.  There are some useful options.


Path problems

Most of the time, the complaints go something like:
xxxx: command not found.
Where xxxx is the name of some command, usually rjq or rj.  This problem is because we're still creating the optimum node images for the cluster.  The default shell settings have not made the node image yet, but it's coming soon.  If you are using the default shell, tcsh, you can fix this problem by copying (or merging if you have customized) the default shell setting file /etc/csh.cshrc to ~/.tcshrc.   Be aware that if you do this, the defaults in /etc/csh.cshrc will be overridden.  You should probably check /etc/csh.cshrc occasionally to see if there are any changes that you would like to incorporate into your settings.


Unusually poor performance

Every once in a while, we see anomalously poor performance.  We haven't been able to fully track down the root of this problem, though we have several suspects.  It does not seem to be associated with any particular nodes, but is intermittent, and strikes seemingly at random.   We are actively investigating this phenominon, so if you experience it, we would very much like to hear about it.  It will help if you can provide a list of hosts on which the problem was experienced, as well as a timeframe.  Possible things to watch out for:

Output redirection to NFS filesystem.  Our NFS is over regular 100Mb ethernet from each node to a 2Gb uplink to the NFS server..  It's possible that other jobs on the same rack could help saturate this network, and applications end up blocking during writes.

Runaway/leftover processes on the nodes.  It's possible that misuse of the cluster or failed jobs will cause some processes to be on the node taking up significant amounts of cycles without PBS or Moab being aware of them.  This would most certainly cause performance issues.

Application I/O consisting of many small write requests (< 100B). NFS performs quite poorly under these conditions.

Update:
Since we've gone through extensive cluster shakedown, part replacement, and node tuning, we have not seen any strange performance problems. Bad CPU's and DIMM's appear to have caused these problems. Please do let us know if you still experience anything strange wrt performance.

Job sits in queue forever -or- qsub -I hangs

There are many reasons why a job might not get scheduled immediately. This can be especially irritating for interactive jobs because you are actively waiting for the job to be scheduled.

The short answer is that your job either violates some policy (such as job limits), lacks sufficient resources, or is somehow in error. Here are a few tips for gathering the information you need to determine why your job is not being scheduled, and/or how to plan your job so that it get's scheduled promptly.


First, make sure your job does not exceed job limits. Current job limits can be reviewed with:

news job.limits

Second, plan your job around the current state of the machine. Check the output of the command:
pbsstate
Pbsstate will display the current queue, what jobs are running and on which partition. From this information, you can usually get a very good idea about how busy the machine is, and on which partition a job with a given number of processors is most likely to run.

Another very useful command is:
showbf
Here's a sample of what showbf might tell you:
Partition     Procs  Nodes   StartOffset      Duration       StartDate
---------     -----  -----  ------------  ------------  --------------
minor           128     64      00:00:00      00:11:10  09:19:12_03/02
minor           130     65      00:11:10       2:00:34  09:30:22_03/02
In this particular case, there are 64 nodes free on the minor partition and another one will become free in 11:10. There is actually a lot of information here. Large jobs have priority over smaller ones. The scheduler schedules larger jobs at the earliest possible time, and then uses the "backfill" time to schedule smaller jobs IF they will not cause the large job to get a late start. In this case, if you queue a 128 processor job (64 nodes) and request a wallclock time of more than 11 minutes, another job requesting 130 processors (65 nodes) will preempt your job (cause your job to not run) because if your job ran, the 65 node job would have to wait.

Okay, so you've planned your job, and it still sits in the queue. There are 4 possibilities:
1) Your job may be preempted by a larger job. See the showbf command above.
2) There just aren't enough free nodes to run your job. Clearly, if there are no nodes available to run your job, it will remain queued until it can be scheduled. You can obtain an estimate of when your job will run by using the command:
showstart <jobid>
3) The job has some problem preventing it from running. Many times, your job will be queued and held, or deferred until an administrator can look at it and either release the hold or notify the user that something is wrong with the job.  There are various commands to obtain more information on what's going on with your job:
showq
checkjob <jobid>
qstat -f <jobid>

A good resource for learning about the various commands you can use to check your job is the Moab manual page. Also, one of the most common reasons a job is held is due to the "#PBS -l ncpus=xxxxx" line in the batch script.  The "ncpus" resource is not supported on this cluster. If you are using this line, remove it, and re-submit your job.

4) Some system error has sabotaged your job. Sometimes, there are just errors that you can't do anything about. If you have a queued job and you suspect some error with it, you can use the command:
checkjob <jobid>
If you read something suspicious there, especially something indicating some error you don't understand, you will probably need help fixing the job. There are a few other clues that might help you determine a problem with a job...

If you see differing states between qstat and showq, this generally indicates an error that is a bit more complicated.  It usually means the job had tried to start, and then one of it's nodes went down, or it's PBS status changed during job startup.  These jobs stay in the queue, because it's possible to clear the job and get it running without resubmission.

If you cannot determine why your job is held or not running, contact us at turing-help@cse.uiuc.edu and we will assist your in getting your job running.

CPP not behaving as expected

 Try
cpp3 -traditional -P

How to acknowledge the use of Turing


The Computational Science and Engineering Program provides the Turing cluster at no cost to individual University of Illinois researchers. We respectfully request that researcher/authors use the following acknowledgement in all publications resulting in any part from research conducted using the Turing resource:
The author(s) gratefully acknowledge(s) the use of the Turing 
cluster maintained and operated by the Computational Science 
and Engineering Program at the University of Illinois.  Turing 
is a 1536-processor Apple G5 X-serve cluster devoted to high 
performance computing in engineering and science.


Why no Xwindows Support for Emacs?

Mac OS X does not come with Xwindows, but it does come with emacs.  So, by default, emacs is not built with Xwindows support.  On Turing, we use xemacs for windowed, x-enabled emacs.  You can find it in /usr/local/bin, which should now be in the default path on all head nodes.

Prompted for passwords

This error is commonly reported for folks who have not set up their passphraseless rsa ssh-keys. Please see the tutorial for how to do this.

Application just hangs

Sorry, can't be of too much help here without a good deal of specifics, but here's some advice....

It's way easier to debug your code in the interactive mode than it is to debug it through batch.  In fact, if your application hangs frequently, it's probably causing problems for the queuing system and it would be much better to get it working interactively before trying in batch again.  Drop us an email, we are more than happy to work with you to help you debug your program and provide the resources you need for an adequate interactive debugging session.


Where is my core file?

On OS X, all core files are written out to the system directory, /cores. Core files have the naming convention core.pid. Another very useful place to find out about your applications that have crashed is in ~/Library/Logs/CrashReporter. Here, you will find a detailed description of what was going on when your app took a dive. The naming convention goes like: <appname>.crash.log.

How do I find open nodes?

The short answer is, you don't. The Turing G5 cluster uses the TORQUE resource manager and the MOAB workload manager to find open nodes and run your job for you.

If you are looking for information helpful in choosing a partition, such as open nodes, jobs currently in the queue, and so forth, you can use the command:
    pbsstate
This will pipe the text state file through less to your terminal. This information should be helpful in helping you determine which network to request, how many open nodes there are, and the current queue size.


How do I use Altivec: FFTs, LAPACK, BLAS, etc?

There are several good sources of information on this stuff. The best place to start looking is:
   man Accelerate
The man page will get you started in determining how to use the Acclerate framework and the Altivec libraries. It's impossible to include all of the variations of what people like to do, but for those with no patience for man pages, here's how you use/link in the simple case:

In C/C++:
#include <Accelerate/Accelerate.h>
and link:
cc -faltivec -framework Accelerate

In Fortran, link like so:
xlf90 -Wl,-framework -Wl,vecLib

The vecLib supports two symbol types for calling from Fortran. They are of the form:
_XXXXX
and
_xxxxx_

The vanilla IBM Fortran compiler (xlf90) will build compatible symbols by passing in the
-qextname
option in the compile. The mpif90 compiler adds this option automatically, so you won't need to specify it if using the MPI compilers.

The man page sites some nice references if you need more information.


What is Turing PM?

The Turing cluster has weekly preventive maintenance (PM) sessions which are currently scheduled on Wednesdays from 09:00 CST. During PM, user access is restricted. PM's requiring system reboots will necessarily kill running jobs, but should not affect queued jobs.
Often during PM, we are able to fit in jobs on the cluster that are difficult or impossible to schedule otherwise. If you have a need to run a job that is very large, or exceeds imposed limits, please let us know, it's possible we can run your job during or immediately after a PM session.


How do I change my shell to my favorite one?

Only the default, tcsh, is officially supported on Turing. We have seen cases where jobs do not work properly if the default shell is changed from tcsh.
You *can* change your shell by exec'ing it from inside your .tcshrc. For example:

exec /bin/bash


So far we have not heard complaints from people who are doing this.


Turing Filesystems

/turing/home: 30 Gb/user quotas, no group quotas.

This filesystem is where your home directories reside. Your most important data (thesis, source code, final results, etc) should be placed here.

It is being backed up nightly, and deleted files may be recovered within 30 days of deletion. We will be making a weekly off-site backup of /turing/home after we have made arrangements for safe and secure off-site storage.


/turing/projects: 500 Gb/user quotas, 4 Tb/group quotas.

This filesystem is for different project groups to be able to share data and space. Generally this is Beckman Institute and CSAR users. Large data sets that are needed for runs but which have copies stored at other supercomputing sites should go here, as well as intermediary files from large processing runs.

It is being backed up nightly, and deleted files may be recovered within 30 days of deletion. We will not be making an off-site copy of this filesystem, so if something destroyed the disks and the tape library next to them, the contents of the filesystem would be lost.


/turing/scratch: no quotas on user or group.

This filesystem should be only used as temporary space -- think of it like a large /scratch or /tmp that is shared on the network.

It is not backed up. If something goes wrong here, all data on /turing/scratch is lost. In order to keep /turing/scratch from filling up, we will be implementing old file deletion policies. Any file older than (some date, probably 2 weeks) will be automatically deleted.

As scratch is used, the date for deletion will be modified as needed.


I followed the tutorial, but passwordless ssh still doesn't work.

Usually this is due to having lax permissions on the .ssh directory. Try the following command:
chmod -R 700 ~/.ssh

If you continue to have trouble, let us know.


I forgot my password!

You will need your UIUC NetID and NetID password to reset your Turing password from the Turing password management site:
Reset Turing password