FAQ for the Altix and ASC Shared clusters

Before Starting

About this FAQ page

If you have loaded this page using the index page you will have frames with a table of contents on the left and the content on the right. The page will also work fine without frames but will not be as easy to navigate.

Getting Started

Where can I find information on known problems?

"Known Problems" are listed per system. Please refer to the Solar, Altix, Data Store, SamFS and Cluster user guides for known problems.

How do I use the iSupport system?

For supporting solar users, the HPCCC uses the iSupport incident management system. Here is some local documentation on using the iSupport system.

How do I use the RT system?

CSIRO ASC use the RT request tracking system to manage user problems. Here is some local documentation on using the RT system.

How do CSIRO users login from outside CSIRO?

The ASC/HPCCC systems aren't reachable directly from outside the CSIRO network. As a CSIRO employee you can use a VPN client to connect to the CSIRO network from a different network. If you haven't done so already, you need to fill in the Remote User declaration form, and then download and install the VPN client. There is more information at: More information ...

If you have problems using the VPN client, we can create ssh keys for you to connect to burnet via a gateway machine, but normally the VPN client gives much better performance.

How do I work with ssh?

In general, you need to use ssh to access ASC/HPCCC systems. For general information about ssh and how to use it, there is much information available on the internet and in books. We recommend our partner's NCI pages on ssh.

For working with ASC/HPCCC and partner systems we have some additional notes on high-performance ssh, passwordless ssh and setting up a .ssh/config file to simplify using ssh. Also the faq item on fast file transfer, relies on ssh for the underlying connection in most cases.

Why is my environment broken?

Most commonly because you have broken things in your login scripts. Try setting them aside and replacing them with new files from /etc/skel (note these file start with '.' so they you must

cp /etc/skel/.* ~
to get them). Then proceed to reinstate only customizations that you actually need - checking to see if they work.

How does shell startup work?

Here are some extended notes on shell startup.

What do I need in my .rhosts file?

Your .rhosts file lists the hosts (and userids) which can use rsh/rlogin to log in to a the host where the .rhosts file resides (where rsh is allowed). It can be tricky getting the entries right.

Here are some more extended advice on setting up .rhosts.

Why are my files which I edited with MSwindows causing problems?

Unix, MSwindows and MacOS use different conventions to mark the end of lines in text files. This can cause numerous problems when working with files accross systems.

Here are some more extended problem symptoms and solutions.

How do I use modules to set up my environment?

The HPCCC uses the modules approach to managing the user environment for different software versions on all NEC Self and cross platform machines. The distinct advantage of the modules approach is that the user is no longer required to explicitly specify paths and environment variables for software version instances. With the modules approach, users simply “load” and “unload” modules to control their environment.

For more information see our Environment Modules page.

How do I change my login shell?

solar:

Login to solar1.bom.gov.au and use the chsh commands as usual. It will take up to 1 hour for the change to be propagated to other nodes. When prompted enter your new login shell path in full.

For available shells execute the command 'chsh -l'

cherax:

cherax% chsh -s <myshell>

# For available shells execute the command 'chsh -l'

burnet users: Please contact ASC

Graphical User Interfaces

For linux applications that require a GUI you have two main choices, both of which involve running an X server.

  1. Run a VNC server on the cluster and connect to it with a VNC client such as RealVNC viewer.
  2. Run an X server (such as Xming) on your desktop machine and then use ssh/PuTTY to connect to remote systems with options for tunnelling X connections.

How do I choose between VNC and a local X display?

VNC is a remote access program that allows you to connect, view and interact with a graphical desktop session on a remote computer. VNC sets up a persistent 'desktop' X display session that you can reconnect to from any PC. You might choose to use VNC because:

A local X server uses the graphics capability of your own PC to display content from applications that use the X display system (run on remote systems). You might choose to use a local X server because:

How do I set up a VNC session?

Setup is a little complicated so we provide a vnc helper script, to simplify the process of setting up a VNC server on a cluster head node. Here are some notes on using the vnc script.

How do I set up an X display on my desktop PC?

There are two freely available X server software packages we recommend for Windows PCs - Cygwin/X which relies on cygwin which makes it more complicated but part of a more general and powerful environment, and Xming which is simpler to install and set up.

Portability

Portability of programs and data from the SX to the Sun Constellation

Data representation and File structure

Data representation and file structure of the Sun Constellation is different to that on the SX-6. In particular, the Sun Constellation (and other Intel-based systems including PCs and the TX-7s) uses little-endian byte order, while the SX-6 uses big-endian byte order. Depending on the compile options used on each system, the sizes of data types may also differ.

The NEC SX-6 at BOM has a non-standard, site default option -ew set. This promotes all REAL and INTEGER data to 8-byte units. If you were not explicitly disabling this option, it will have had the following effects on your SX-6 models:

Both the Intel and Sun compilers support a variety of options for reading and writing non-native files - see the Sun and Intel compiler manuals for details. (Useful keywords are F_UFMTENDIAN and FORT_CONVERT (Intel) and -xfilebyteorder (Sun).

If you were using F_EXPRCW or F_PARTRCW to read or write files on the NEC SX-6, the file structure will not be readable on the Sun Constellation and you will need to convert your files before the SX-6 is decommissioned.

NetCDF files are portable and do not need any conversion.

do_tx7 command

If your scripts on the NEC SX-6 used the do_tx7 command for transferring files to external systems, you can use the following command as a drop in replacement:

qrsh -q dm
For more detail see the File Transfer section in the solar userguide.

If you find any problems with replacing do_tx7 with qrsh -q dm or qrsh -q dmop, then please contact the HPCCC and provide the full details of your problem.

$KEEPDIR and $WORKDIR

$KEEPDIR is no longer supported as the /bm/keep filesystem is no longer used. The $HOME file system should be used for $KEEPDIR purposes.

$WORKDIR is no longer supported. Please use the $FLUSHDIR file system for working files.

Monitoring

How do I make top behave nicely on cherax?

You can get process information on cherax from 'top', and it is better with some customization.

top shows a list of running processes (see 'man top').

If you type 'h' you get a help page.

You probably want to type 't' to toggle the summary information (to off).

If you type 'f' you get to choose the fields that will be displayed, Y or ] will show which cpu a process is running on - so you can monitor whether it moves about - or what cpu it is bound to if you are using dplace.

Other useful settings are:

Batch Use

How do I manage my jobs as a group?

The qselect command allows you to list jobs matching set criteria. For instance if you had hundreds of jobs and wanted to cancel them all, you might run a command like:

qselect -u myuserid | xargs -n 1 qdel
or
for JOBID in $(qselect -u myuserid); do
   qdel $JOBID
done
You could add other criteria, for example to match only jobs with a given name. Another action you might like to take for a set of queued jobs is to run qalter to change the amount of resources you requested. See the qselect man page for more information.

I have a chain of jobs - How do I create job dependencies in Torque?

There is a utility on cherax and burnet that will queue jobs with dependencies (for torque). To use it for a simple sequence of jobs (job1.q, job2.q and job3.q) do:

  qdep.pl job1.q job2.q job3.q

It does a few other neat things as well.

wil240@cherax:~> qdep.pl -h
usage: qdep.pl [options] [script] [script ...]
     submits jobscripts from commandline or stdin (one filename
     per line)
     Sets up dependencies between jobs to run in sequence
     unless the -f and/or -l options are specified.
     Dependency syntax is compatible with torque qsub.
     All other qsub options must be in directives in the scripts.
     options: -f first_script (run "first_script" first, then the rest
              of the scripts run in parallel on successful completion)
                    -l last_script (run "last_script" after completion
                       of all the other scripts - but see option -t)
                    -t only run last_script if all the others succeed
                       or each script in sequence if the preceding
                       one succeeds
                    -q (be quiet)
                    -h (this message)
     examples: qdep.pl 1.q 2.q 3.q
                      qdep.pl -t 1.q 2.q 3.q
                      qdep.pl -f prep.q p1.q p2.q p3.q
                      qdep.pl p1.q p2.q p3.q -l cleanup.q
                      qdep.pl p1.q p2.q p3.q -t -l sum.q
     version: 1.0

How do I run a set of jobs in parallel?

Running independent jobs in parallel is what a batch system does! You can prepare a small number of scripts by hand (one for each job) and submit them one-by-one to the batch system.

However if there are a lots of scripts/jobs with a pattern, it will save a lot of effort and be less error-prone if you automate creating the files and submitting the jobs. This is particularly suited to 'ensemble' jobs or 'parameter sweeps' where many very similar jobs are run to generate results over a defined parameter space.

Managing a lot of scripts in separate files can be awkward and error prone (though it has some benefits), so it is good to have a single script that you can submit many times withy different parameters to divide up the work.

There is more information and extended examples including information on array jobs in our article on running a set of jobs in parallel.

I have a large set of jobs - How can I make only a limited number run at once (in Torque)?

Well you could use jobs dependencies (see above) or...

There is a utility on cherax and burnet that will periodically check for jobs in a directory and 'top-up' the queue if there are too few jobs running or queued. To use it to run up to 10 jobs matching *.q in directory $HOME/qspool for up to 24 hr, run:

  qpool

Submitting lots of jobs is usually fine, but if you want to add extra jobs (say for development tasks) they will have to wait for all of your own jobs already queued to run (unless you take steps to hold them). This could be a good reason for you to use qpool.

It has quite a few options to customize it's behaviour, including a 'match' option, to restrict which jobs are counted when querying how many jobs are running with qstat.

wil240@cherax:~> qpool -h
qpool -y | -k | [options] [directory]
  A script to periodically check for batch scripts in "directory"
  and submit them using qsub if there are less than a maximum 
  number of queued jobs (from the user).
  The directory ~/qspool will be used if a directory is not specified.
  Job scripts are moved to submitted/ in "directory" and the jobs are 
  submitted from the directory where qpool was run.
  While qpool is running, be careful only to ever have valid jobscripts
  in "directory" (eg. edit the files elsewhere).
  Jobs are only counted if qstat output matches the -m option (eg. job name)
options:
  --verbose|v: be verbose (will be to log unless in forground), also -vv
  --pattern|p=pattern: specify pattern (glob) for scripts ('*.q')
  --qsub|q=qsub_command: specify name of qsub command (qsub)
  --options|o=qsub_options: specify qsub options ("")
  --qstat|a=qstat_command: specify name of qstat command (qstat)
  --number|n=N: maximum number of jobs to queue at once (10)
  --match|m=match: regular expression for matching qstat output to count jobs
  --separation|s=N: number of seconds to wait (600)
  --timeout|t=N: maximum number of seconds to run (1 day)
  --restart|r: whether to restart at timeout
  --foreground|f: whether to stay attached to the terminal
  --query|y: simple check for existing qpool instances
  --kill|k: kill off existing qpool instances
  --help|h|?: this message
eg. qpool -p 'proj1.*.job' -n 5 -t 3600 ./my_jobs/
  every ten minutes for one hour queue up to 5 jobs from my_jobs/proj1*.job
    qpool -n 20 -r
  indefinitely top up number of queued jobs to 20 from ~/qspool/*.q

Why do I get:'Warning: no access to tty'?

Users whose login shell is tcsh should ignore the messages:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
This is a well known 'feature' of tcsh which occurs when a login tcsh is started in batch. The warning would be relatively easy to suppress (by modifying tcsh and recompiling) but the additional system maintenance is not considerd worthwhile.

Why does my batch job fail - files not found?

Batch jobs start running in your home directory, not the directory from which you submit the job.

You need to 'cd' to the directory where you want to run early in the script. eg.:

cd $HOME/project1/run5
The name of the directory from which you submit a job is available to the job in the variable PBS_O_WORKDIR so you might want to:
cd $PBS_O_WORKDIR
or if you also want to also run the script interactively (nb. sh syntax):
cd ${PBS_O_WORKDIR:-.}
In this case if PBS_O_WORKDIR is undefined the command become 'cd .'

Where is my batch job output?

The batch job output is usually spooled to a local system directory during the run and at the end of the job, copied back to the place the job was submitted.

There are a number of things that can go wrong with returning the output. Torque/pbs uses rcp/scp/cp. There is some extended advice on submitting jobs from cluster compute nodes(focusing on the torque batch system).

What are the resource options?

For torque, see "man pbs_resources".

Resources have differing importance on different machine architectures.

For a distributed cluster, nodes and walltime determine how long you have exclusive access to nodes).

For a shared-memory machine, cputime and memory determine which jobs can run concurrently.

What are the queue limits?

Run the command "qstat -Qf" to inspect queue limits.

How do I handle commands failing (in batch scripts)?

You can test for the success of commands and take action if a command fails. Note that some commands are more likely to fail than others (eg. commands that access resources which may be unavailable).

Here are some notes on handling errors in (bourne) shell scripts.

How do I determine the number of cpus I have allocated in a batch job?

For cherax users: A script called 'getnumcpus.pl' is available that will print to standard output the number of cpus available to your batch job (or indeed interactive shell). This is useful where you need to specify the number of cpus in multiple places withing your script. eg. OpenMP and MPI use
Here are some notes on Using getnumcpus.pl

Why isn't my job starting?

Scheduling is incredibly complicated process and a lot is going on behind the scenes that users don't see. The scheduler takes a number of important factors into consideration when deciding when and where to run a job on a cluster, including: reservations on the cluster, resources, and job priority.

Reservations are when a portion of nodes are set aside for a particular purpose. Certain users or jobs might not be able to access these nodes within that time period. A reservation might be in place for system maintenance, to accommodate a large long running job, or to only allow jobs of less than 2 hours to run on particular nodes (for example).

Resources are requested by a job and include things like the number of cores, wall time, and virtual memory required to run a job. There might not be enough resources for a particular job to start straight away, or for it to run on particular nodes.

Job are also started based on a priority, which factors in its time in the queue, the number of jobs queued by a user, and the parallelism or overall utilisation achieved on the cluster.

Common reasons for jobs not starting are:

The following commands can help diagnose why your job might not be starting:

Please check the man pages of these commands for more information.

For more information on scheduling, please click here.

Performance Tuning

I have a code that can use OpenMP or MPI. Which should I use?

If you intend to run within a machine with shared memory (including within a multicpu cluster node), either OpenMP or MPI should be OK in principle. Here are a few things to consider in making your choice between OpenMP and MPI.

How should I choose between using OpenMP and MPI to parallelize code?

In short, moving to MPI is much more difficult but the potential pay-off is much greater. Hardware availability is an over-riding factor as the model for parallelism is closely related to the underlying hardware. Otherwise the human factors are probably the most significant and you should consider the cost of development, debugging and associated support. Here are some more notes on making a choice between OpenMP and MPI.

File I/O and transfer

What is the best place to store files?

For CSIRO users, your home directory on cherax is the best place to keep important, large data files. See the user guide for the data store for more about this.

Programs running on cherax can of course access this data store directly. Programs running on burnet can also access the data store via the NFS-mounted $STOREDIR but should use rcp or scp to move data to and from the data store - see the burnet user guide.

What is the best way to transfer files?

In brief use GridFTP (with sshftp or gsiftp), scp/sftp or rsync. With sshftp, scp/sftp and rsync, ssh is used to make the connection and globus-url-copy, scp/sftp or rsync manages the transfer of files over the connection (or that data is transferred on additional connections for GridFTP). The 'standard' openssh implementation does not have very good buffering and this limits the rate of transfer that can be acheived (1MB/s). To achieve 20-30 MB/s or more, you can use GridFTP or the High Performance ssh (hpn-ssh) patched version of openssh. You may need to set up passwordless ssh or benefit from a customized .ssh/config to simplify commands for sshftp/scp/sftp/rsync.

With ssh, the choice of cipher for encryption used on the data transfer (-c option) can also influence the tranfer rate. hpn-ssh supports null encryption on the data-stream - a feature usually disabled with openssh. See the Pittsburg Supercomputing Center High Performance ssh FAQ.

You are likely to get the best raw transfer speed from GridFTP, but depending on the size and number of files, and the ease of use and availability of the software, you may be better off using scp/sftp or rsync.

If you are transferring many small files ("small" == less than about 10 MB) you will get the best performance by using rsync (with hpn-ssh), or with GridFTP with pipelining and/or concurrency options (globus-url-copy -pp -cc 2).

If you are transferring a few large (100's of MB or more) you will generally get the best performance from GridFTP.

There are examples of using scp in each of the userguides for ASC systems.

Also, GridFTP supports parallel data streams for file transfers, as does iRODS (and the ARCS data fabric) for the icommands - command line clients.

Finally, here are some (old) notes with more detail on rcp, scp, ftp, jumbo frames and problems related to transferring large files.

What if the files are on my windows desktop PC?

We recommend the winscp client which uses scp (or sftp) to transfer files.

Some may prefer filezilla which is a graphical sftp and ftp client which can be run on mac, linux or windows. You may need to use port 22 and check 'Bypass Proxy' tickbox when having connection issues.

How can I synchronize a directory structure?

We recommend the use of rsync for transferring files between systems - it has many advantages, including not transferring files to the destination if they are already there, and many other capabilities.

globus-url-copy with the -sync option may be a useful (though not so full featured) alternative.

What is the best way to improve I/O performance in my program?

  1. Don't do it! That is, minimise the amount of I/O you do.
  2. Don't do it! (OK... well as little as necessary).
  3. If you must do I/O then do it in large chunks - ie. write out a whole array at once rather than iterating over the elements (especially of multiple arrays).
  4. Use binary rather than formatted I/O.
  5. Use memory rather than disk where appropriate/possible.
  6. Use fast disk.
  7. Buffering can have a profound influence on I/O performance. Fortran mostly hides the buffering from you (usually good) but also usually allows environment variables to tune the buffering.

Tuning buffering on cherax.

The SGI Propack for Altix systems includes a facility called FFIO. This provides a mechanism for improving I/O performance without the need to change source code or rebuild your application. The FFIO feature can be enabled and tuned for specific files by setting environment variables.

FFIO provides additional I/O buffer cache that bypasses Linux kernel I/O cache, giving more control over I/O characteristics and the potential to significantly improve performance.

Recently (Jan 2011), an application that reads and processes a 13 GByte netCDF file was accelerated by 25% by using FFIO. Additionally the system time (spent in Linux kernel) for this application was reduced by 50%, freeing up kernel resources for other applications.

For more details about FFIO see Chapter 7 of SGI's Linux Application Tuning Guide (CSIRO ASC internal site).

solar

Idle login sessions

Any interactive sessions on login nodes that have been idle for 3 hours or more are terminated. This ensures that login nodes do not become overloaded with too many sessions.

Note: Exemptions can be made for those currently using software that cannot be run in any other way than interactively on the login nodes for long periods.

linuxgpu (CSIRO's GPU cluster)

Resource Requests and Scheduling

When submitting a job to the cluster with 'qsub' it is necessary to include a resource request.

This can be done on the command line with with a lower case L argument
>qsub -l resourcelist

or as part of the job script with
#PBS -l resourcelist

The resourcelist is formed with comma separated entries for example:
>qsub -l vmem=4GB,walltime=1:00:00

Requesting CPU cores

There are two ways to request CPU cores; procs=P and nodes=N:ppn=M

To request a single core anywhere in the cluster you could use
>qsub -l procs=1

this would be identical to
>qsub -l nodes=1:ppn=1

If you request multiple cores using procs=P the scheduler may be able to start your job sooner than if you were using the other syntax. However there is no guarantee that cores requested with procs=P will be allocated on the same node. For MPI jobs procs=P is usually fine but for jobs requiring shared memory (such as OpenMP) you will need to use the nodes=N:ppn=M format. In fact some MPI codes will actually run faster when all processes are within a single node making it a good choice there too.

When in doubt use
>qsub -l nodes=N:ppn=M

Requesting GPUs in addition to CPU cores

Requesting an entire compute node will give you exclusive access to both GPUs on that node so no additional syntax is required
>qsub -l nodes=N>:ppn=8

If you don't need a whole node, a single GPU can be reserved with the addition of gres=gpu to your request
>qsub -l nodes=N:ppn=1,gres=gpu

The gres=gpu feature is applied for each CPU core requested, so it can be used with at most ppn=2 (as there are only two GPUs per compute node).
>qsub -l nodes=N:ppn=2,gres=gpu

You can also request multiple GPUs per CPU core
>qsub -l nodes=N:ppn=1,gres=gpu:2

Requesting gres=gpu:3 or with ppn > 1 will fail as it would require more than 2 GPUs per compute node.

It is not possible using the currently available syntax to request other combinations such as 2 CPU cores and 1 GPU

In such cases where you need more CPU cores than GPUs, simply request the whole node.

Finally, as two jobs in a single node can be allocated cores on a single CPU, it is possible to get bandwidth contention to the GPUs. Poorly written codes can even use the wrong GPU, so if this is a concern use any of the options above that request both GPUs or the whole node.

Can I use OpenGL on the Linux GPU cluster?

The short answer is that OpenGL is not officially supported on Tesla GPUs

This means any Cuda SDK demos using OpenGL will probably fail.

There are workarounds where you instruct OpenGL to render to an off-screen buffer but Nvidia does not provide technical support for it. Instead they recommend buying Quadro hardware for situations where you need to do visualisation. At some point visualisation nodes (with Quadros) could be added to the cluster for this purpose if there is a demand for it.

The longer answer is that the Cuda SDK demos are taking advantage of the Cuda-OpenGL interoperability libraries by keeping the data that is generated by Cuda in the GPUs device memory and then rendering it directly using OpenGL. Because the Tesla does not really support OpenGL it is necessary to copy that data from Telsa device memory to the device memory of an OpenGL-capable GPU and render it from there. The SDK demos use the interop libraries so they do not have to expose that sort of complexity (and because its faster). In a real application you might a) render to an off-screen buffer, b) use a Quadro, or c) simply write out the data to disk then copy and render it on a local workstation after the job is complete.

However it is still possible to do local rendering for OpenGL applications through an X session. Basically when you are NOT using a VNC server, all the OpenGL commands are being redirected to the X-server's GPU (as well as copies of all the vertex buffers, etc required to generate the image). This can be very network intensive depending on the nature of the visualisation. So it may be possible to use the OpenGL-capable GPU on the cluster head node to render graphics through X after generating data with a Tesla on one of the compute nodes. The process would involve running a vncserver on the head node, then starting a X-forwarding login (ssh -X or qsub -X) to a compute node and would still not work with the Cuda-OpenGL Interop libraries.

Can I use OpenGL on the Windows GPU cluster?

As of mid-2009, Tesla GPUs are not detected as Display Adapters under Windows so you can not use them for DirectX rendering (OpenGL under Linux is possible but not offically supported). This means you also need a regular display adapter in the Windows machine and it must be a Nvidia card because Windows does not support multiple display device drivers (ie more than one vendor at a time)

Additionally, Remote Desktop replaces the normal video driver with a special driver designed to make capturing screen updates more efficient. This has the effect of making CUDA stop working when Remote Desktop is active as it relies on the NVIDIA driver. The solution is to use VNC, which does not replace the video driver.