Turing FAQ
Welcome to the Turing G5 Cluster Frequently Asked Questions
page. We hope the following tidbits of advice will be
helpful to you along your way to getting set up and running on the
cluster. Please let us know if something is unclear, does not
work in solving your problem, or you find something new that should be
included here. We will share your experiences with other users through
the enhancement of these pages. Please mail questions, comments,
and suggestions to turing-help@cse.uiuc.edu.
Unable to open GM port
The most common cause of this is when "qsub" and/or "rjq" are not used as
directed. Make sure you have specified "ppn=2", never "ppn=1". Also, always
use "rjq", never "mpirun". Occasionally, there is indeed a problem with one of
the nodes upon which you are trying to run. If you are sure you are using
qsub and rjq properly and are still getting this error, please report the
problem with the relevant job stdio files to
turing-help@cse.uiuc.edu
as soon as possible. We will be able to clean it up pretty
quickly or remove the errant node from the batch pool. Note that it is
essential that you include the stdout for the job.
Permission denied -or- Host key verification
failed
This error is commonly reported in your
batch output or interactive tool output when your
known_hosts file
has not been updated as described in the
tutorial.
To fix this problem, issue the following commands:
rm ~/.ssh/known_hosts
ln -s /dev/null ~/.ssh/known_hosts
Warning: no access to tty (Bad file descriptor),
Thus no job control in this shell
Unfortunately, this annoying message is something we live with on this
platform. Fortunately, the message is harmless and can be
ignored. If you are interested, the crux of the problem lies in
the implementation of openpty() on OS X, and how the TORQUE resource
manager uses it.
Qdel woes:
Of all cluster problems, qdel problems
are by far the most common, and unfortunately the most annoying.
TORQUE is not without it's share of problems on all platforms, and in
particular, it's beta software on OS X. However, there aren't a
lot of choices out there for resource managers that work on OS X.
On the bright side, many people are working on TORQUE, including
Cluster Management Resources,
and it is quickly coming up to speed on the platform. Here's
qdel's variety of errors:
Fails to remove job
This is a common one. Qdel will
return without complaining, yet the job in question keeps running
happily. Sometimes jobs can take up to 20 minutes to finally
exit after being qdel'd. If, after this time, there is no
reaction, sometimes a
qsig -t TERM
command will force the job to make it's exit. Failing that, drop
a mail to
turing-help@cse.uiuc.edu,
and we'll clear the job for you.
Unable to connect to MOM
This usually happens when the job (or
sometimes the mom) fails or crashes on the root node of the job.
You will definitely need help clearing this one, drop a mail to
turing-help@cse.uiuc.edu
with the jobid.
Hangs forever
This is a result of a serious problem
we haven't yet tracked down completely. The queue is on the brink
of hanging, the scheduler can no longer contact the queue.
Multiple PBS server processes are spawned. Basically, it's PBS
anarchy. The PBS server has to be brought down and back up again
- contact
turing-help@cse.uiuc.edu
immediately. It doesn't take too awfully long to fix this
problem, but no jobs will be scheduled until it comes to our attention
that this is happening.
Where is my output?
In PBS, the stdout and stderr files are
kept local to the root node of the job until the job completes.
Once your job exits, the spooled output and error will be copied to the
file you specified in your batch script with "#PBS -o", etc. This
copying process usually takes a few seconds. You can override the
output spooling behavior by redirecting the stdout and stderr on the
command line:
rjq my_app.exe >& my_output.txt
However, be aware that this may affect your application
performance. See the
performance
section. You may also use the
qpeek
command to have a look at your output as it is collected on the root
node. The syntax is:
qpeek <jobid>
Just type qpeek -help
to obtain a usage guide. There are some useful options.
Path problems
Most of the time, the complaints go
something like:
xxxx: command not found.
Where
xxxx is the name of some
command, usually
rjq or
rj.
This problem is because we're still creating the optimum node images
for the cluster. The default shell settings have not made the
node image yet, but it's coming soon. If you are using the
default shell,
tcsh, you can fix this
problem by copying (or merging if you have customized) the default
shell setting file
/etc/csh.cshrc to
~/.tcshrc.
Be aware that if you do this, the defaults in /etc/csh.cshrc will be
overridden. You should probably check /etc/csh.cshrc occasionally
to see if there are any changes that you would like to incorporate into
your settings.
Unusually poor performance
Every once in a while, we see
anomalously poor performance. We haven't been able to fully track
down the root of this problem, though we have several suspects.
It does not seem to be associated with any particular nodes, but is
intermittent, and strikes seemingly at random. We are
actively investigating this phenominon, so if you experience it, we
would very much like to hear about it. It will help if you can
provide a list of hosts on which the problem was experienced, as well
as a timeframe. Possible things to watch out for:
Output redirection to NFS filesystem. Our NFS is over regular
100Mb ethernet from each node to a 2Gb uplink to the NFS server..
It's possible that other jobs on the same rack could help saturate this
network, and applications end up blocking during writes.
Runaway/leftover processes on the nodes. It's possible that
misuse of the cluster or failed jobs will cause some processes to be on
the node taking up significant amounts of cycles without PBS or Moab
being aware of them. This would most certainly cause performance
issues.
Application I/O consisting of many small write requests (< 100B). NFS
performs quite poorly under these conditions.
Update:
Since we've gone through extensive cluster shakedown, part replacement, and
node tuning, we have not seen any strange performance problems. Bad CPU's and
DIMM's appear to have caused these problems. Please do let us know if you
still experience anything strange wrt performance.
Job sits in queue forever -or- qsub -I hangs
There are many reasons why a job might not get
scheduled immediately. This can be especially irritating for interactive jobs
because you are actively waiting for the job to be scheduled.
The short answer is that your job either violates some policy (such as job
limits), lacks sufficient resources, or is somehow in error. Here are a few
tips for gathering the information you need to determine why your job is not
being scheduled, and/or how to plan your job so that it get's scheduled
promptly.
First, make sure your job does not exceed job limits. Current job limits can
be reviewed with:
news job.limits
Second, plan your job around the current state of the machine.
Check the output of the command:
pbsstate
Pbsstate will display the current queue, what jobs are running and on which
partition. From this information, you can usually get a very good idea about
how busy the machine is, and on which partition a job with a given number of
processors is most likely to run.
Another very useful command is:
showbf
Here's a sample of what showbf might tell you:
Partition Procs Nodes StartOffset Duration StartDate
--------- ----- ----- ------------ ------------ --------------
minor 128 64 00:00:00 00:11:10 09:19:12_03/02
minor 130 65 00:11:10 2:00:34 09:30:22_03/02
In this particular case, there are 64 nodes free on the minor partition and
another one will become free in 11:10. There is actually a lot of information
here. Large jobs have priority over smaller ones. The scheduler schedules
larger jobs at the earliest possible time, and then uses the "backfill" time
to schedule smaller jobs IF they will not cause the large job to get a late
start. In this case, if you queue a 128 processor job (64 nodes) and request
a wallclock time of more than 11 minutes, another job requesting 130 processors
(65 nodes) will preempt your job (cause your job to not run) because if your
job ran, the 65 node job would have to wait.
Okay, so you've planned your job, and it still sits in the queue. There are 4
possibilities:
1) Your job may be preempted by a larger job. See the showbf command above.
2) There just aren't enough free nodes to run your job. Clearly, if there are
no nodes available to run your job, it will remain queued until it can be
scheduled. You can obtain an estimate of when your job will run by using the
command:
showstart <jobid>
3) The job has some problem preventing it from running. Many times, your job
will be queued and held, or deferred until an administrator can look at it
and either release the hold or notify the user that something is wrong with
the job. There are various commands to obtain more information on what's
going on with your job:
showq
checkjob <jobid>
qstat -f <jobid>
A good resource for learning about the various commands you can use to check
your job is the
Moab manual page. Also, one of the
most common reasons a job is held is due to the "#PBS -l ncpus=xxxxx" line in
the batch script. The "ncpus" resource is not supported on this cluster.
If you are using this line, remove it, and re-submit your job.
4) Some system error has sabotaged your job. Sometimes, there are just errors
that you can't do anything about. If you have a queued job and you suspect
some error with it, you can use the command:
checkjob <jobid>
If you read something suspicious there, especially something indicating some
error you don't understand, you will probably need help fixing the job. There
are a few other clues that might help you determine a problem with a job...
If you see differing states between qstat and showq, this generally indicates
an error that is a bit more complicated. It usually means the job
had tried to start, and then one of it's nodes went down, or it's PBS
status changed during job startup. These jobs stay in the queue,
because it's possible to clear the job and get it running without
resubmission.
If you cannot determine why your job is held or not running, contact us
at
turing-help@cse.uiuc.edu
and we will assist your in getting your job running.
CPP not behaving as expected
How to acknowledge the use of Turing
The Computational Science and Engineering Program provides the Turing
cluster at no cost to individual University of Illinois researchers. We
respectfully request that researcher/authors use the following acknowledgement
in all publications resulting in any part from research conducted using the
Turing resource:
The author(s) gratefully acknowledge(s) the use of the Turing
cluster maintained and operated by the Computational Science
and Engineering Program at the University of Illinois. Turing
is a 1536-processor Apple G5 X-serve cluster devoted to high
performance computing in engineering and science.
Why no Xwindows Support for Emacs?
Mac OS X does not come with Xwindows,
but it does come with emacs. So, by default, emacs is not built
with Xwindows support. On Turing, we use xemacs for windowed,
x-enabled emacs. You can find it in /usr/local/bin, which should
now be in the default path on all head nodes.
Prompted for passwords
This error is commonly reported for
folks who have not set up their passphraseless rsa ssh-keys.
Please see the
tutorial for how to do this.
Application just hangs
Sorry, can't be of too much help
here without a good deal of specifics, but here's some advice....
It's way easier to debug your code in
the interactive mode than it is to debug it through batch. In
fact, if your application hangs frequently, it's probably causing
problems for the queuing system and it would be much better to get it
working interactively before trying in batch again. Drop us an
email, we are more than happy to work with you to help you debug your
program and provide the resources you need for an adequate interactive
debugging session.
Where is my core file?
On OS X, all core files are written out to the system directory,
/cores. Core files have the naming convention core.pid. Another very
useful place to find out about your applications that have crashed is
in ~/Library/Logs/CrashReporter. Here, you will find a detailed
description of what was going on when your app took a dive. The naming
convention goes like: <appname>.crash.log.
How do I find open nodes?
The short answer is, you don't. The Turing G5 cluster uses the
TORQUE resource
manager and the
MOAB
workload manager to find open nodes and run your job for you.
If you are looking for information helpful in choosing a partition, such
as open nodes, jobs currently in the queue, and so forth, you can use the
command:
pbsstate
This will pipe the text state file through less to your terminal. This
information should be helpful in helping you determine which network to
request, how many open nodes there are, and the current queue size.
How do I use Altivec: FFTs, LAPACK, BLAS, etc?
There are several good sources of information on this stuff. The best place to
start looking is:
man Accelerate
The man page will get you started in determining how to use the Acclerate framework
and the Altivec libraries. It's impossible to include all of the variations of what
people like to do, but for those with no patience for man pages, here's how you use/link
in the simple case:
In C/C++:
#include <Accelerate/Accelerate.h>
and link:
cc -faltivec -framework Accelerate
In Fortran, link like so:
xlf90 -Wl,-framework -Wl,vecLib
The vecLib supports two symbol types for calling from Fortran. They are of the form:
_XXXXX
and
_xxxxx_
The vanilla IBM Fortran compiler (xlf90) will build compatible symbols by passing in the
-qextname
option in the compile. The mpif90 compiler adds this option automatically, so you won't need to specify it if using the MPI compilers.
The man page sites some nice references if you need more information.
What is Turing PM?
The Turing cluster has weekly preventive maintenance (PM) sessions which are
currently scheduled on Wednesdays from 09:00 CST. During PM, user
access is restricted. PM's requiring system reboots will necessarily kill
running jobs, but should not affect queued jobs.
Often during PM, we are able to fit in jobs on the cluster that are difficult
or impossible to schedule otherwise. If you have a need to run a job that is
very large, or exceeds imposed limits, please let us know, it's possible we
can run your job during or immediately after a PM session.
How do I change my shell to my favorite one?
Only the default, tcsh, is officially supported on Turing. We have seen cases where jobs
do not work properly if the default shell is changed from tcsh.
You *can* change your shell by exec'ing it from inside your .tcshrc. For example:
exec /bin/bash
So far we have not heard complaints from people who are doing this.
Turing Filesystems
/turing/home: 30 Gb/user quotas, no group quotas.
This filesystem is where your home directories reside. Your most
important data (thesis, source code, final results, etc) should be
placed here.
It is being backed up nightly, and deleted files may be recovered within
30 days of deletion. We will be making a weekly off-site backup of
/turing/home after we have made arrangements for safe and secure
off-site storage.
/turing/projects: 500 Gb/user quotas, 4 Tb/group quotas.
This filesystem is for different project groups to be able to share data
and space. Generally this is Beckman Institute and CSAR users. Large
data sets that are needed for runs but which have copies stored at other
supercomputing sites should go here, as well as intermediary files from
large processing runs.
It is being backed up nightly, and deleted files may be recovered within
30 days of deletion. We will not be making an off-site copy of this
filesystem, so if something destroyed the disks and the tape library
next to them, the contents of the filesystem would be lost.
/turing/scratch: no quotas on user or group.
This filesystem should be only used as temporary space -- think of it
like a large /scratch or /tmp that is shared on the network.
It is not backed up. If something goes wrong here, all data on
/turing/scratch is lost. In order to keep /turing/scratch from filling
up, we will be implementing old file deletion policies. Any file older
than (some date, probably 2 weeks) will be automatically deleted.
As scratch is used, the date for deletion will be modified as needed.
I followed the tutorial, but passwordless ssh still doesn't work.
Usually this is due to having lax permissions on the .ssh directory. Try the following command:
chmod -R 700 ~/.ssh
If you continue to have trouble, let us know.
I forgot my password!
You will need your UIUC NetID and NetID password to reset your Turing password from the Turing password management site:
Reset Turing password