Tutorial for Using Turing

Slides from "Turing Tips" Presentation

The following tutorial is intended to aid in the setup and use of the Turing cluster. All Turing users should read this document. Even experienced Turing users should glance back here time to time for updated information. Questions or requests for help should be addressed to turing-help@cse.uiuc.edu.

This tutorial assumes no prior knowledge of basic UNIX commands or MPICH, and will guide you step-by-step through setting up your environment, compiling, and running an MPI parallel job with PBS.  If for some reason you wish to use the 100Mb Ethernet network instead of the high speed Myrinet network, then replace each instance of mpich-gm below with mpich-eth. Please note that users are NOT ALLOWED to use any non-system provided communications packages on the Turing cluster. You must use either the MPICH or Charm++ installation provided by the sytem.

Note that when following the tutorial below, it is best to cut and paste the commands directly from the webpage to your shell. This ensures no transcription errors. Sometimes the html'd command lines can be misleading.


Turing Environment Setup

All users interact with UNIX-like systems through a shell. Your shell provides your prompt, keeps track of your command history, and, among other things, provides you with a means to manage your computing environment. Turing users will have tcsh by default. We highly recommend your setting up a fresh environment on Turing. Do not import your shell environment from other machines. If you have changed your shell or your defaults, then the commands described here may or may not be appropriate and you should follow along using the commands that are appropriate for your shell and it's settings.

To determine which shell you use, issue the following command at your prompt:

echo $SHELL
The result of which should be /bin/tcsh. If you do not have /bin/tcsh, and you have not explicitly changed your shell, then please contact turing-help@cse.uiuc.edu.

Setting up SSH keys for the Cluster

In order to run jobs on the cluster, you need to set up passwordless login for internal cluster connections only. DO NOT share the keys that you produce in this step with other hosts, and do not copy your keys from other hosts to this cluster.

To generate your keys, use the following command:

   ssh-keygen -t rsa
Choose the default answers for everything by just hitting enter until the command returns you to the shell prompt. Finally, copy the keys:
  cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

NOTE: Passwordless keys require user only permissions on the .ssh directory. To ensure this is the case, use the following command:

    chmod -R 700 ~/.ssh

Setting up your SSH known hosts file

Remove any known_hosts file that may exist:
 
  rm ~/.ssh/known_hosts

Link your known_hosts file to /dev/null to disable host checking:
  ln -s /dev/null ~/.ssh/known_hosts

Setting up for MPICH

To set up your environment for MPI jobs under MPICH, we may need to change your .tcshrc file. You will find your .tcshrc file in your home directory. When you first log into a UNIX machine, you are automatically working in your home directory. You can always return to your home directory by issuing the command:

   cd
You can always find out what your current directory is by issuing the command:
   pwd -L
If you are in your home directory, this command should return with /turing/home/${LOGIN}, where ${LOGIN} is your user name. Please make sure you are in your home directory before continuing.

Now we will make sure your path is set up properly.  Your path is a list of all the directories in which your shell searches for the commands that you type at the command line.  To determine where a command is located, you can use the which utility.  Try it now for the MPICH C compiler:

   which mpicc
This command should return with /turing/software/mpich-gm/bin/mpicc. If it does, then you can skip ahead to the compiling section.  Otherwise, you need to add MPICH to your path with the command:
   setenv PATH "/turing/software/mpich-gm/bin:${PATH}"
Now try the above which command again.  If you get the appropriate return string: /turing/software/mpich-gm/mpicc, then you should add the PATH modification command to your .tcshrc file, if it exists. To see if you have an initfile, issue the following command:
   cat .tcshrc 
The result of this command will either be the contents of your .tcshrc, or something along the lines of:
   cat: .tcshrc: No such file or directory
If you see the "No such file or directory" message, you can create a .tcshrc with the following command:
   echo 'setenv PATH "/turing/software/mpich-gm/bin:${PATH}"' > .tcshrc
Once you have done this, you can move on to the next section on compiling.

Otherwise, you already had a .tcshrc file, and you will need to edit it to set the appropriate path. Use your favorite editor to insert the above setenv command into your .tcshrc file.

If, after following these instructions, you cannot get the proper returned results and you can find no mistakes in your procedure, please send a help request email to turing-help@cse.uiuc.edu.


Compiling MPICH programs

You should now be ready to compile an MPI program. The MPICH implementation provides wrapper compilers for compiling and linking C, C++, Fortran 77, and Fortran 90 programs.  The commands are mpicc, mpif77, mpicxx, and mpif90, respectively.  For a description of options available with these commands, refer to the appropriate man pages, or use the "-help" command line option.

We will now compile a sample program. First, make a directory in which to work and cd into it by issuing the following command:

mkdir mpich_tutorial;cd mpich_tutorial
You will find quite a few examples of MPI codes in the /usr/local/encap/mpich-gm/examples directory. In this tutorial, we will compile two of these examples, an MPI version of "Hello World!" in C++, and a parallel computation of PI in Fortran 90.

Copy the example programs to your location by issuing the following command:

cp /turing/software/mpich-gm/examples/{hello++.cc,pi3f90.f90} .
The MPICH wrapper compilers are just scripts which invoke the system's compilers with the appropriate options to build MPI programs. On Turing, the C and C++ compilers are the GNU GCC 3.3. The Turing Fortran compiler is IBM XLF. Using the MPICH version of these compilers frees the user from needing to construct a complex compile command to compile MPI programs.

To compile hello++.cc, issue the following command:

mpicxx -o hello++ hello++.cc
Mpicxx is the MPICH wrapper for the GNU g++ compiler. The -o option specifies the name of the executable output file, which should now be present in your current directory. Check that this is so by listing the contents of the directory:
ls
If you do not have a file named hello++ in your current directory, then something has gone wrong - and you may need to retrace your steps to figure out what's wrong. If you can't determine the problem, send a help request to turing-help@cse.uiuc.edu.

We will run the "Hello World!" program in the next section, but for now, let's compile the Fortran 90 example. Compile it with the following command:

mpif90 -o pi -qsuffix=f=f90 pi3f90.f90
Again, check your current directory for the pi file. If it is not there and you cannot determine why, please ask for help before continuing.

That just about covers compiling MPI programs. There really is not a whole lot to it.


Running Jobs

Before running batch jobs on Turing, it is important to understand a couple points about the cluster itself. For one, Turing is an experimental platform. On rare occasions, your jobs may be interrupted, killed, or may unexpectedly die. We, like you, are learning by doing. As we discover the many nuances of running large scientific computing clusters, policies will change, job submission mechanisms and policies may change. We do our best to keep the system up-to-date and it's users informed of current and upcoming changes.

Second, Turing is a shared resource. You must respect other users as well as try to be patient with those who unwittingly or accidentally do something to impose upon your activities on the cluster. Please help us help you by sharing your knowledge, resources, and tools for using the cluster with other less knowledgeable users as well as with us. Never run intensive compute jobs on the Turing frontend array. Short, non-intensive interactive jobs are fine - but use discretion so that you do not interfere with the work of other users.

There are two ways to run on Turing. For most normal computing jobs, you will want to submit a batch job which will find open processors for a requested amount of time and run your job for you automatically when processors become available. For development work and debugging, users may also run short, interactive jobs on a limited number of nodes. Both procedures are explained below.

Submitting Batch Jobs

Turing uses the Torque resource manager, which is itself a variant of OpenPBS.  Torque uses the qsub mechanism for placing  jobs into the batch queue.  Typically, one will use a batch script to submit jobs.  The batch script is just a shell script that includes directives that Torque uses to run your job.  Turing now uses the MOAB Cluster Management Suite.

Copy the example batch script to your current working directory:

    cp /turing/software/mpich-gm/examples/test.qsub .
Examine the script (less test.qsub) to become familiar with what #PBS directives are required and what each of them does.  The qsub manual is a good place to start for understanding this script, and learning to write your own.


#
# Sample Batch Queue Script
#
# Option: -q, which batch queue to run in
#PBS -q batch
#
# Option: -l nodes, how many nodes to run on, note
# that the ":ppn=2" directive is important. It
# must always be used.
#PBS -l nodes=8:ppn=2
#
# Option: -l ncpus, how many processors to run on
# is not supported on Turing, request nodes only
##PBS -l ncpus=16
#
#
# Option: -l walltime, request this amount of wall
# clock time for your job. Format is hh:mm:ss
# The default is 1 hour, the limit is 3 days.
# The following requests 5 hours:
#PBS -l walltime=5:00:00
#
#
# Option: -j oe, combines stdout and stderr into a
# single file.
#PBS -j oe
#
# Option: -o, specifies the name of the output file
# which will be found in ${PBS_O_WORKDIR} after your
# job completes.
#PBS -o test.out
#
# Option: -N, names the job
#PBS -N test_hello
#
# Change directories to the directory where the qsub
# command was issued.
cd ${PBS_O_WORKDIR}
#
# Copy in a fresh version of the hello++.
cp /turing/software/mpich-gm/examples/hello++ .
#
# We now insist users use the rjq command for running
# all MPI or Charm jobs whether in batch or interactive.
rjq ./hello++

Please note the usage of rjq:

    rjq <your_app> <app_arguments>
The rjq utility will automatically use the correct communication libraries (MPI or Charm++), the correct number of processors (always 2xNumber_of_Nodes), and the right network interface, to launch your job.

Now you are ready to submit your job to the batch queue.

    qsub test.qsub
The output of this command should look something like:
    39.turing-2.turing.uiuc.edu
39 is the job id and the rest is the submission host.   Now running qstat should show you the status of your job:

    [turing-2:~] mtcampbe% qstat
Job id Name User Time Use S Queue
-------------- -------------- ----------- --------- - -----
39.turing-2 test.qsub mtcampbe 0 R batch
The status column (the S column) will tell you whether your job is queued, running, or exiting with a Q, R, or E, respectively.  Once the job is finished running, your output should be in the file test.out.  Examine this file now:

    less test.out
The contents of this file (i.e. the output of your job) should look something like:

Hello World! I am 2 of 4
Hello World! I am 3 of 4
Hello World! I am 1 of 4
Hello World! I am 0 of 4

Your output may differ from this - or may even be mangled. This is because the output is not synchronized. As long as there are no errors reported, and your output vaguely resembles the above, then it has worked.

To remove one of your jobs from the queue, use the qdel command:

      qdel <job id>

Please keep a loose watch on your jobs. If it appears that your job has terminated abnormally, or refuses to exit from the queue, please let us know by sending a mail to turing-help@cse.uiuc.edu. If you have such problems, please test your application interactively and make sure it functions properly outside of the batch system before attempting to submit another similar job to the queue.

PBS, Torque, and Moab Resources

You can obtain manual pages on most PBS commands on Turing by using the standard UNIX man command:
    man qsub
man pbs
Here are some useful links for learning about basic and advanced features of vanilla PBS, Torque, and Moab:

PBS
OpenPBS Public Home
Torque
Torque job submission guide
Moab Documentation
Manual page for qsub

Running Interactive Jobs

Turing users are currently allowed to run interactive jobs on any set of nodes in the cluster.  Interactive jobs are very useful in debugging and developing your application, particularly when you need to make many test runs corresponding to source code changes.   Currently, the same limits that apply to batch jobs also apply to interactive jobs.  Clearly, this can be abused.  We ask only that if you are running an interactive job, it must be truly interactive.  You should be actively watching the job at all times.   Be as swift and light as possible when running interactive jobs.  Do not request large sets of nodes for long periods of time and let them idle.  Use common sense.  Be nice.  One or two incidents of abuse of this feature will  end up  in it's removal.
 
To run interactive jobs on Turing, users must request nodes and time similar to any other batch job. 

We will run the F90 pi example that we compiled in the previous section as an example of how to run interactive jobs. If you didn't follow the compiling tutorial, don't worry, this example will still work for you.

Begin by requesting a set of nodes and getting an interactive shell:
   qsub -I -l nodes=2:ppn=2

This command will block(not return to a prompt) until your nodes have been allocated.  Once it does return, you will be presented with an interactive shell prompt on the root node of your allocation or job. 

Note that, by default, you will begin your interactive session in your home directory. Change back to your working directory with:
   cd ${PBS_O_WORKDIR}
Now we can proceed to running the interactive program:
   rjq ./pi
You should see output that resembles the following:
   Process  1  of  4  is alive
Process 2 of 4 is alive
Process 3 of 4 is alive
Process 0 of 4 is alive
Enter the number of intervals: (0 quits)
Notice that the processes are numbered from 0 to 3 for a total of 4. Now you can enter a number and let the MPI parallel application calculate PI. When you are done toying with it, enter 0 to let the program terminate normally.

Rjq will now automatically clean the nodes of your processes.

You will end your interactive session when you exit from the interactive shell:
  exit

Getting Help and Support on Turing

If you have trouble with this tutorial, setting up your environment, building your application, or job submission, send a help request to turing-help@cse.uiuc.edu.

If you have trouble building a parallel application, please include the entire compiling and linking output after passing the -v option to the compilers with your help request email.

If you have trouble running a parallel application, please include the output of your job when you specify a -v flag to rjq.

It is almost always helpful to also see the output of the env command for help requests regarding building or executing applications.

There is now a Turing Users Discussion/Help list for the Turing user community. To subscribe, see CSE-TURING-L.