Before Starting
About this Guide
This document contains a list of known problems and a change log which can be checked for a summary of recent updates.
If you have loaded this guide using the index page you will have frames with a table of contents on the left and the userguide on the right. The guide(s) will also work fine without frames but will not be as easy to navigate.
The guides are intended to be introductory in nature though they will provide references to core documentation and detailed site specific information and FAQs.
More formal vendor supplied documentation is referenced here.
Quick Start
Registering to use the Altix and ASC Shared clusters
All users need to be registered to use the ASC shared systems. To use systems residing in ASC Docklands (and ASC software), accounts can be requested by completing the online application form. Only CSIRO users will be registered on the CSIRO systems by default. CAWCR staff from the BoM will be registered on CSIRO systems if they need and request access. To access other CSIRO ASC shared systems (not at Docklands , such as the GPU cluster), CSIRO staff can just ask for access at hpchelp@csiro.au
The front-end hostnames of the computational hosts are:
CSIRO or BoM collaborators will need to get a CSIRO ident as a collaborator via their research contacts who would then reqest access.
The BoM Sun Constellation system solar is described separately.
Connecting to the Altix and ASC Shared clusters
In general you must use secure shell (ssh) to connect to the Altix and ASC Shared clusters. You may also need VPN software to get into the csiro.au or bom.gov.au network first if you need to connect from outside. See our ssh FAQ item for more detail. An ssh connection can be used in conjunction with X server software for applications that require a graphical interface, or in many cases VNC can be used with an X server running on a cluster head node. VNC can provide a persistent desktop-like session which you can reconnect to after disconnecting and often performs better than other options as the X server is close to the application.
All of the ASC systems use the CSIRO NEXUS (staff) identifiers as usernames, and NEXUS authentication. Your password will be the same as on standard CSIRO systems using the NEXUS user identifier. If you change your NEXUS password on a Windows system, the change will propagate to the ASC systems, with some time delay. Please don't try changing your password on the ASC systems.
For CSIRO users the system uses group names which are typically an acronym for a CSIRO business unit.
We can create other groupnames for projects crossing boundaries, and for projects within business units.
Interactive and Batch
The operating system on all of the systems is Unix. For those who are not familiar with Unix or its concept of shells, processes, etc., you may find the University of Edinburgh Unixhelp System useful.
The systems provide users with a configurable login shell which is used to interpret commands to the system. The shell (and other programs) run in an "environment" which is partly inherited from its parent process and (for shells) reconfigured during the shell startup via a series of files which contain shell commands. Different specifically named files and command syntax are used with different types of shell.
Your account will be set up with an initial environment via a default .cshrc file, and an equivalent .profile or .bash_profile
Most people use a combination of interactive and batch access to the systems to get their work done.
Most computationally intensive work is done as batch jobs, where the work is broken up into separate tasks, where each task is made to correspond with a "shell script" which is a sequence of commands written in a text file. These "batch scripts" (along with a resource requirement) are submitted to the batch system which can then manage scheduling of resources to run a mixture of jobs efficiently. Typically a batch system avoids contention between jobs by not over-allocating resources. This results in good overall system utilization at the expense to individual jobs of waiting in a queue when the system is busy.
Interactive access to the systems is also allowed, for managing and monitoring of batch jobs, file management, development and debugging. Where possible we encourage these activities to be conducted on other platforms, but recognize there are many legitimate reasons for interactive access. Limits are places on interactive session to avoid uncontrolled contention disrupting everybody. Interactive access to batch sessions with dedicated resources is also possible, including use of graphical interfaces.
To customize both interactive and batch shells, edit your .cshrc or .profile files to add whatever features you prefer, but please retain the active lines from the template files in /etc/skel in these respective files. These are there to allow your environment to be kept up to date with system changes.
On each system, any process you run has limits imposed on it via the shell. These include a time limit and a memory use limit. To see what these limits are first check which shell you are using. Enter the command:
These limit apply to both interactive and batch processes. Batch jobs may have additional limits which are monitored by the batch system. The limits are not published here as they are liable to change, and it is also possible to vary these limits on an 'as needs' basis by project or user.
Getting Help with a problem or query
Before contacting HPCCC/CSIRO ASC please confirm that your problem and workaround is not listed in the known problems.
Users experiencing any problems with CSIRO ASC systems should contact the HELPLINE - 03 8601 3800 (external) / 93 3800 (CSIRO internal) - or
Problems reported via a web browser or email will be entered in our RT (Request Tracker) system, and then can be made visible to the staff most able to solve the problem. You can use the web interface to check progress of a problem and/or follow up the request by replying to email sent to you about the request (in plain text - no html email please). Please only include immediately relevant history in your reply as otherwise the information is duplicated in the system and the problem becomes difficult to follow. We have also written some further guidelines in the faq on using the RT system.
If there is an urgent query out of hours, please contact 0428 108 333 for assistance.
Getting Started
About the Altix ia64 NUMA Cluster
The Altix system provides three functions:
The Altix runs the Linux operating system. It has compilers for C, C++ and Fortran 95 and a batch system to manage data-intensive computing.
The Altix has many software packages installed - we will generally install any Open Source package users' request, and have acquired some commercial software.
Connecting to the Altix ia64 NUMA Cluster
The Altix system's full network name is: cherax.hpsc.csiro.au Users should connect using CSIRO NEXUS user ids using ssh from outside the HPCCC. Access using rsh is generally available for access between machines within the HPCCC.
Access for registered users at locations outside CSIRO can be arranged (using a combination of password and ssh key-based authentication and a gateway machine in the CSIRO external services network).
Environment
Users have a home directory on cherax which is shared with some other systems. This means that settings in dot files (.profile .login .bashrc etc.), must be compatible with multiple operating systems. There are skeleton 'dot' files in /etc/skel which have template content for ensuring that both interactive and batch environments are customized. The .bashrc and .cshrc files also enable non-login/non-interactive environment customization for bash and (t)csh. If you have environment problems it is a good idea to set aside your 'dot' files and copy new ones from /etc/skel ("cp /etc/skel/.??* $HOME") to see if your customizations are breaking something.
Users may wish to separate ia64 and i686/i386/ia32 executables by having directories, $HOME/bin/ia64 and $HOME/bin/i686 which can be added to your path as $HOME/bin/$HOSTTYPE (or if you prefer $HOME/$HOSTTYPE/bin) (see HOSTTYPE below). System independent executables could go in $HOME/bin.
The Altix system uses locally developed shell startup scripts. The aim is for users to need minimal extra customization in their own .profile or .login and to simplify HPCCC system use by having similar users' environments. If you think that the default environment has particular short-comings, please contact HPCCC so that we can improve things.
The scripts set up the environment variables referring to filesystems and attempt to set reasonable terminal settings and some defaults for OMP_NUM_THREADS=1, GROUP, PBS_QUEUE and PATH.
HOST is set to `hostname -s`, HOSTNAME is set to `hostname` and HOSTTYPE is set to `uname -m` which is ia64. You can read the startup scripts in /usr/local/etc/startup to see examples of testing for host type, batch invocation and whether a shell is interactive (attached to a terminal) or not.
There are different amounts of shell initialization done depending on the shell and method of invocation. Full global environment setup and logout processing is only performed for login shells. This includes PBS batch jobs (with no -S option). Partial processing is done for all bash and (t)csh shells so that environment variables pointing to static directory names are set and available to rcp and rsh commands. You can test your non-login environment via "rsh cherax env" (or ssh).
Software Setup
To customize your environment to use particular software packages the Environment modules utility is available.
Run module avail to see what software packages are availale and module load package to set up your environment for a particular software package.
File Transfer
Transfers from external hosts to/from HPCCCFor transfer of files to or from the HPCCC machines from outside the HPCCC, we recommend the use of the commands scp and sftp. These provide encryption of passwords and data.
Transfers between HPCCC hostsFor transfers between machines at the HPCCC, we recommend rcp, because passwordless transfers are possible, and for selected hosts there is a jumbo frames network to enable faster data transfers. ftp can also be used between some machines.
For synchronization of files, rsync is available and can be used with ssh or rsh.
High speed transfers to/from HPCCCFor transferring large amounts of data to cherax on long and high bandwith network links we recommend using hpn ssh.
hpn-ssh is a patched version of ssh which has improved performance essential to high speed scp transfers. We are running a hpn-ssh sshd on port 22000 on cherax, and it is configured to only allow access with key-based authentication. The ASC firewall has been configured to allow access to hpn-ssh from APAC-NF and iVEC. The end of the ssh connection receiving the data is the one where performance/buffering matters most so the following examples are for data transfers to cherax
cherax% module load hpn-ssh cherax% scp -rp me123@ac.apac.edu.au:forcherax/. fromapac/. cherax% rsync --rsh=ssh -av me123@ac.apac.edu.au:forcherax/. fromapac/.
Initiated from apac:
ac% scp -i ~/keys/my_cherax_key -P 22000 -rp forcherax/. \
abc123@cherax.hpsc.csiro.au:fromapac/.
ac% rsync --rsh='ssh -i ~/keys/my_cherax_key -P 22000' -av forcherax/. \
abc123@cherax.hpsc.csiro.au:fromapac/.
Related systems
Cherax also hosts the CSIRO Data Store.
The datastore is shared to the blade clusters burnet and nelson and the utility PC farrer/portal which has a common home directory for most users.
Accounting
Process accounting is available on the Altix system. See man csa. The main user tool is ja.
Monitoring Resource Usage
Each interactive and batch process you run has a time limit and a memory use limit imposed on it. Interactive sessions are limited to 30 Gbyte of memory, and individual processes to 30 minutes CPU time. These limit values may be revised in future.
You can get information about limits and resource usage using the following commands:
Reports on the Data Store usage can be seen at
CSIRO Data Store usage reports
Batch Use
All of the ASC systems (except the BoM solar system) use the Torque batch system derived from Portable Batch System (OpenPBS). OpenPBS. Other batch systems (SGE, NQSII, ...) provides similar functionality but with different specific commands and options.
Most jobs require greater resources than are available to an interactive session. A batch job is really a type of shell script containing a set of commands which are executed for you without any "terminal" interaction. Such job scripts must be submitted to the batch job system with the qsub command. The batch system manages efficient scheduling of running the submitted jobs on the available resources. The batch system also allows an interactive mode.
A shell script can be as simple as a sequence of commands written in a file or it can include more sophisticated use of flow control, variable substitution and error recovery. Here are some hints on error recovery in batch scripts.
You submit jobs to the Torque batch system using the command qsub specifying the number of CPUs, the amount of memory, and the length of time needed (and, possibly, other resources). The batch system runs the job when the resources are available, subject to constraints on maximum resource usage.
Interactive Batch Jobs
The qsub -I option will result in an interactive shell being started out on the allocated cpu once your job starts. A submission script is not used in this mode - you must provide all qsub options on the command line.
Your job is subject to all the same constraints and management as any other job in the same queue. In particular, it may be accounted for on the basis of walltime, since you may have dedicated access to the cpus reserved for your request. Don't forget to exit your interactive batch session to avoid leaving cpus idle on the machine, and unavailable to others.
Interactive batch jobs are likely to be used for testing or debugging large or parallel programs, but may also be used to run software that needs interaction to operate effectively or for work that is best done in an interactive mode. Since you want interactive response, it may be necessary to use a high priority queue (shorter jobs) to run promptly.
To use an X display in an interactive batch job, use ssh -X (or -Y) to login to the front-end machine or use a VNC session on the login node (do not change the DISPLAY variable ssh or VNC provides) and then submit your job with at least the following options to make the current DISPLAY environment variable be set in the batch job:
% qsub -I -v DISPLAY
You will usually need to request some resources to get anough time and memory dedicated for your interactive task.
Basic commands
The basic PBS commands are:
% qsub -l walltime=20:00:00,vmem=300MB cd my_dir ./a.out ^D (that is control-D)or
% qsub -l walltime=20:00,vmem=300MB jobscriptwhere jobscript is an ascii file containing the shell script to run your commands (not the compiled executable which is a binary file). More conveniently, the qsub options can be placed within the script to avoid typing them for each job:
% cat jobscript #!/bin/sh #PBS -l walltime=20:00:00,vmem=300MB cd $PBS_O_WORKDIR ./a.outYou submit this script for execution by PBS using the command:
% qsub jobscript
Notice that the PBS commands are all at the start of the script, that there are no blank lines between them, and there are no other non-PBS commands until after all PBS resources are described. The variable, PBS_O_WORKDIR will be defined in the job as the directory from which qsub was run. This may or may not be where you want to "cd" to.
qsub options of note:
Batch resources
Cherax is a shared memory system but users are being allocated dedicated cpus (and shared memory). This means that the resources that are most important to specify are much the same as on a distributed cluster system where users can get exclusive access to compute nodes. The most important qsub resource options for cherax are:
Please don't use -l ncpus=N anymore, use -l nodes=1:ppn=N
Queues and Scheduling
Queue Structure
cherax has three execution queues:
Scheduling
The scheduling aims are to:
The batch job scheduler being used is Maui. Detailed knowledge of the scheduler is not necessary to run jobs but can help users understand what governs the order in which jobs are run.
From a user's perspective, it is very important that you minimize your requests for resources (i.e. walltime and memory). Otherwise your job may be queued or suspended longer than necessary. Of course, make sure you ask for sufficient resources - a little monitoring of resources used by past jobs might help.
Currently when a large job has been waiting in the queue for long enough, the system will attempt to get it to start by not starting other work or "draining". When the system is busy this is the most common reason for small jobs not starting promptly.
Maui works by assigning jobs to "reservations" which consist of a block of cpus for a period of time (defined by the requested walltime). The highest priority idle, queued job will get allocated a reservation in the future (and you can tell what time it will start by) and smaller/shorter lower priority jobs may be "backfilled" around the reservations.
Maui also allows "standing" reservations which can be used to reserve a block of resources for jobs which have particular attributes (eg. short jobs, jobs in a specific queue). Some cpus may be assigned to standing reservations. This is why sometimes queued jobs will not start, even if there are idle cpus, as the jobs does not have the right attributes to run in the reservation(s). This mechanism is used to prevent cherax being monopolised by long running jobs and to enable better turnaround for (shorter) development work.
There are a number of maui commands that add a scheduler aware interface to the batch system. These commands do not have man pages, but do have a -h help option and include:
File Systems
A number of file systems are available, each with a different purpose. Variables are defined at login to refer to the parts of the filesystems available to users.
In the table below: 'properties' denotes the Management attributes of the underlying filesystem: back-up (b), quota (q), local disk (l), job-temporary (j), flush (f), by arrangement (a) and/or migrated (m).
| Variable name | properties | purpose |
|---|---|---|
| $HOME | m, q, b | login settings and persistent backed up large capacity storage |
| $FLUSHDIR | q, f | working files (semi-)persistent between sessions. Ensure that critical files left here are backed up elsewhere |
| $DATADIR | q | persistent files for use in multiple jobs |
| $TMPDIR | q, j | job-temporary files - automatic cleanup |
Flushing is implemented on the $FLUSHDIR area based on necessity but with a minimum lifetime of 7 days. Files newer than the minimum lifetime will never be (automatically) flushed.
$DATADIR can be used to hold persistent files which will not be migrated, eg. this would be useful for user installed software or data files which are to be used repeatedly over a period during which they would otherwise be repeatedly migrated and restored. Ensure that critical files left here are backed up.
For information about how the migrating file system works, please see the companion guide to the Data Store.
Use the quota command to see your usage and the limits on each file system. Note that there are quotas on both the space occupied and the number of inodes (loosely, the number of files).
Compiling
Compilers and Options
There are multiple versions of the compilers available and you can set up your environment for a particular version using Modules,
eg. module load intel-fc/9.1.033
Compiling hints
Using MPI
MPI is a parallel program interface for explicitly passing messages between parallel processes - you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile.
Compiling and linking
The only option that is needed on the altix to compile MPI programs id the link option -lmpi.
Running MPI jobs
Use the mpirun command to start an MPI executable, both when running within a batch job and when running small interactive test jobs.
To run a small test with 4 processes (or tasks) where the MPI executable is called a.out, enter the command:
% mpirun -np 4 ./a.out
For larger jobs and production use, submit a job to the batch system
with a command like
% qsub
cd $PBS_O_WORKDIR
mpirun -np 4 ./a.out
^D
%
Further mpirun details
You can use dplace with mpirun to make the worker mpi processes bound to cpus - and increase the efficiency of memory access. The example becomes:
% mpirun -np 4 dplace -x1 ./a.outthe -x1 option skips the second mpi process which is a 'shepard' process which contributes little to the numerical work and need not be bound.
Controlling MPI execution
A number of environment variables are available to control the behaviour
of MPI jobs. These are described in the MPI man page.
Common problems
The amount of memory that MPI executables on cherax set aside can be large. See the point above for information on environment variables to control this.
Using OpenMP
OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelize loops and specify certain properties of variables.
Compiling and linking
Fortran and C with OpenMP directives are compiled as:
% ifort -openmp myprog.f -o myprog.exe
% icc -openmp myprog.c -o myprog.exe
Compiling hints
Running OpenMP jobs
To run the OpenMP job interactively, first set the OMP_NUM_THREADS environment variable then run the executable:
% env OMP_NUM_THREADS=4 ./a.out
For larger jobs and production use, submit a job to the PBS batch system with something like
% qsub -l nodes=1:ppn=4,walltime=30:00
#!/bin/sh
OMP_NUM_THREADS=4
export OMP_NUM_THREADS
cd $PBS_O_WORKDIR
dplace -x2 ./a.out
^D
%
Note: dplace is used to make the spawned threads bound to cpus and increase the efficiency of memory access. The -x2 option skips the 2nd thread which is a 'helper' thread which contributes little to the numerical work and need not be bound.
Common problems
Two of the most common problems encountered in parallelizing code in shared memory (or porting paralleized code) are stack issues due to the multi-threaded parallel execution model and data scoping issues which may manifest as uninitialized or over-written variables.
Autoparallelizing with the compilers
The intel compilers can automatically parallelize code at the level of loops, to run in shared-memory. The results can be good - but only for a relatively small class of codes. In general, parallelization is most effective when applied at the highest possible level and a better result (than automatic parallelization) can be acheived by adding OpenMP directives and 'helping' the compiler to identify parallelizable code. You can try automatic parallelization and test if you get any speedup. Also you can use the information from the compiler as a start for adding OpenMP directives and to identify code that inhibits parallelism. The intel compiler options are:
% ifort -parallel prog.f
or
% icc -parallel myprog.c
By default this reports to the screen which loops were parallelized. More
information can be obtained by using the -par_report options. To
run on 4 processors do
% env OMP_NUM_THREADS=4 time ./a.out
The time output will show the cpu and elapsed execution time.
See the Intel Fortran manual for more details on use.
Code Development
Debugging
The idb and gdb debuggers are provided on the Altix system for C, C++ and Fortran. They support the debugging of simple programs and core files, and code with multiple threads. Using idb and gdb is similar:
% icc -g prog.c
% gdb ./a.out
(gdb) run (gdb) print var (gdb) quit
csh syntax: limit coredumpsize unlimited bash syntax: ulimit -c unlimited
The totalview debugger is also available and provides a rich
set of debugging features via an interactive GUI (and command-line).
This tool is effective for more complex debugging and in parallel codes.
Debugging Parallel programs
For debugging OpenMP and MPI codes we recommend using the Totalview
debugger. To use the Totalview debugger, compile your code with -g
then in an interactive batch job load the totalview module:
% ifort -g prog.c
% module load totalview
% totalview ./a.out -a arg1 arg2
and for MPI codes:
% ifort -g mpiprog.c -lmpi
% module load totalview
% totalview mpirun -a -np 1 ./a.out arg1 arg2
Profiling
There are few profiling tools of varying levels of sophistication available on the Altix system. Suggestions for their use follow.
% ifort -p -o prog.exe prog.f % ./prog.exe % gprof ./prog.exe gmon.out > prog.gprof
Profiling MPI Code
We can work on installing MPE if there is demand or purchasing Vampir or Blade.
Other Documentation
The SGI Altix documentation is visible to CSIRO and authenticated users. There is an SGI guide for linux application tuning.
Man pages are available. Software often has documentation included in the distribution and installation. You can look in the install tree on the system.
Known Problems
We have upgraded the default Intel Fortran and C/C++ Compilers to 10.x There was a change between 9.x and 10.x where the function "?0_memcopyA" was removed from the shared libraries and is now statically linked.
If you get the following error please rebuild your code with 10.x:
a.out : symbol lookup error: a.out: undefined symbol: ?0_memcopyA
Alternatively, as a temporary work-around until you re-build your code, load the 9.x module(s) in your batch script:
module load intel-fc/9.1.033 intel-cc/9.1.038
Changelog
Here is a list of recent updates in this userguide for quick reference for users returning to this guide.
To Do
Here is a list of pending updates to this userguide.