Altix ia64 NUMA Cluster Local Userguide

Before Starting

About this Guide

This document contains a list of known problems and a change log which can be checked for a summary of recent updates.

If you have loaded this guide using the index page you will have frames with a table of contents on the left and the userguide on the right. The guide(s) will also work fine without frames but will not be as easy to navigate.

The guides are intended to be introductory in nature though they will provide references to core documentation and detailed site specific information and FAQs.

More formal vendor supplied documentation is referenced here.

Quick Start

Registering to use the Altix and ASC Shared clusters

All users need to be registered to use the ASC shared systems. To use systems residing in ASC Docklands (and ASC software), accounts can be requested by completing the online application form. Only CSIRO users will be registered on the CSIRO systems by default. CAWCR staff from the BoM will be registered on CSIRO systems if they need and request access. To access other CSIRO ASC shared systems (not at Docklands , such as the GPU cluster), CSIRO staff can just ask for access at hpchelp@csiro.au

The front-end hostnames of the computational hosts are:

cherax.hpsc.csiro.au
CSIRO Datastore Host, Large Shared Memory Multiprocessor
burnet.hpsc.csiro.au
CSIRO Capacity Compute Cluster
linuxgpu.csiro.au
CSIRO GPU cluster

CSIRO or BoM collaborators will need to get a CSIRO ident as a collaborator via their research contacts who would then reqest access.

The BoM Sun Constellation system solar is described separately.

Connecting to the Altix and ASC Shared clusters

In general you must use secure shell (ssh) to connect to the Altix and ASC Shared clusters. You may also need VPN software to get into the csiro.au or bom.gov.au network first if you need to connect from outside. See our ssh FAQ item for more detail. An ssh connection can be used in conjunction with X server software for applications that require a graphical interface, or in many cases VNC can be used with an X server running on a cluster head node. VNC can provide a persistent desktop-like session which you can reconnect to after disconnecting and often performs better than other options as the X server is close to the application.

All of the ASC systems use the CSIRO NEXUS (staff) identifiers as usernames, and NEXUS authentication. Your password will be the same as on standard CSIRO systems using the NEXUS user identifier. If you change your NEXUS password on a Windows system, the change will propagate to the ASC systems, with some time delay. Please don't try changing your password on the ASC systems.

For CSIRO users the system uses group names which are typically an acronym for a CSIRO business unit.

We can create other groupnames for projects crossing boundaries, and for projects within business units.

Interactive and Batch

The operating system on all of the systems is Unix. For those who are not familiar with Unix or its concept of shells, processes, etc., you may find the University of Edinburgh Unixhelp System useful.

The systems provide users with a configurable login shell which is used to interpret commands to the system. The shell (and other programs) run in an "environment" which is partly inherited from its parent process and (for shells) reconfigured during the shell startup via a series of files which contain shell commands. Different specifically named files and command syntax are used with different types of shell.

Your account will be set up with an initial environment via a default .cshrc file, and an equivalent .profile or .bash_profile

Most people use a combination of interactive and batch access to the systems to get their work done.

Most computationally intensive work is done as batch jobs, where the work is broken up into separate tasks, where each task is made to correspond with a "shell script" which is a sequence of commands written in a text file. These "batch scripts" (along with a resource requirement) are submitted to the batch system which can then manage scheduling of resources to run a mixture of jobs efficiently. Typically a batch system avoids contention between jobs by not over-allocating resources. This results in good overall system utilization at the expense to individual jobs of waiting in a queue when the system is busy.

Interactive access to the systems is also allowed, for managing and monitoring of batch jobs, file management, development and debugging. Where possible we encourage these activities to be conducted on other platforms, but recognize there are many legitimate reasons for interactive access. Limits are places on interactive session to avoid uncontrolled contention disrupting everybody. Interactive access to batch sessions with dedicated resources is also possible, including use of graphical interfaces.

To customize both interactive and batch shells, edit your .cshrc or .profile files to add whatever features you prefer, but please retain the active lines from the template files in /etc/skel in these respective files. These are there to allow your environment to be kept up to date with system changes.

On each system, any process you run has limits imposed on it via the shell. These include a time limit and a memory use limit. To see what these limits are first check which shell you are using. Enter the command:

ulimit -a
for sh/ksh/bash users.
limit
for csh/tcsh users.

These limit apply to both interactive and batch processes. Batch jobs may have additional limits which are monitored by the batch system. The limits are not published here as they are liable to change, and it is also possible to vary these limits on an 'as needs' basis by project or user.

Getting Help with a problem or query

Before contacting HPCCC/CSIRO ASC please confirm that your problem and workaround is not listed in the known problems.

CSIRO ASC Systems - cherax, burnet, gpu cluster

Users experiencing any problems with CSIRO ASC systems should contact the HELPLINE - 03 8601 3800 (external) / 93 3800 (CSIRO internal) - or

Problems reported via a web browser or email will be entered in our RT (Request Tracker) system, and then can be made visible to the staff most able to solve the problem. You can use the web interface to check progress of a problem and/or follow up the request by replying to email sent to you about the request (in plain text - no html email please). Please only include immediately relevant history in your reply as otherwise the information is duplicated in the system and the problem becomes difficult to follow. We have also written some further guidelines in the faq on using the RT system.

If there is an urgent query out of hours, please contact 0428 108 333 for assistance.

Getting Started

About the Altix ia64 NUMA Cluster

The Altix system provides three functions:

The Altix runs the Linux operating system. It has compilers for C, C++ and Fortran 95 and a batch system to manage data-intensive computing.

The Altix has many software packages installed - we will generally install any Open Source package users' request, and have acquired some commercial software.

Connecting to the Altix ia64 NUMA Cluster

The Altix system's full network name is: cherax.hpsc.csiro.au Users should connect using CSIRO NEXUS user ids using ssh from outside the HPCCC. Access using rsh is generally available for access between machines within the HPCCC.

Access for registered users at locations outside CSIRO can be arranged (using a combination of password and ssh key-based authentication and a gateway machine in the CSIRO external services network).

Environment

Users have a home directory on cherax which is shared with some other systems. This means that settings in dot files (.profile .login .bashrc etc.), must be compatible with multiple operating systems. There are skeleton 'dot' files in /etc/skel which have template content for ensuring that both interactive and batch environments are customized. The .bashrc and .cshrc files also enable non-login/non-interactive environment customization for bash and (t)csh. If you have environment problems it is a good idea to set aside your 'dot' files and copy new ones from /etc/skel ("cp /etc/skel/.??* $HOME") to see if your customizations are breaking something.

Users may wish to separate ia64 and i686/i386/ia32 executables by having directories, $HOME/bin/ia64 and $HOME/bin/i686 which can be added to your path as $HOME/bin/$HOSTTYPE (or if you prefer $HOME/$HOSTTYPE/bin) (see HOSTTYPE below). System independent executables could go in $HOME/bin.

The Altix system uses locally developed shell startup scripts. The aim is for users to need minimal extra customization in their own .profile or .login and to simplify HPCCC system use by having similar users' environments. If you think that the default environment has particular short-comings, please contact HPCCC so that we can improve things.

The scripts set up the environment variables referring to filesystems and attempt to set reasonable terminal settings and some defaults for OMP_NUM_THREADS=1, GROUP, PBS_QUEUE and PATH.

HOST is set to `hostname -s`, HOSTNAME is set to `hostname` and HOSTTYPE is set to `uname -m` which is ia64. You can read the startup scripts in /usr/local/etc/startup to see examples of testing for host type, batch invocation and whether a shell is interactive (attached to a terminal) or not.

There are different amounts of shell initialization done depending on the shell and method of invocation. Full global environment setup and logout processing is only performed for login shells. This includes PBS batch jobs (with no -S option). Partial processing is done for all bash and (t)csh shells so that environment variables pointing to static directory names are set and available to rcp and rsh commands. You can test your non-login environment via "rsh cherax env" (or ssh).

Software Setup

To customize your environment to use particular software packages the Environment modules utility is available.

Run module avail to see what software packages are availale and module load package to set up your environment for a particular software package.

File Transfer

Transfers from external hosts to/from HPCCC

For transfer of files to or from the HPCCC machines from outside the HPCCC, we recommend the use of the commands scp and sftp. These provide encryption of passwords and data.

Transfers between HPCCC hosts

For transfers between machines at the HPCCC, we recommend rcp, because passwordless transfers are possible, and for selected hosts there is a jumbo frames network to enable faster data transfers. ftp can also be used between some machines.

For synchronization of files, rsync is available and can be used with ssh or rsh.

High speed transfers to/from HPCCC

For transferring large amounts of data to cherax on long and high bandwith network links we recommend using hpn ssh.

hpn-ssh is a patched version of ssh which has improved performance essential to high speed scp transfers. We are running a hpn-ssh sshd on port 22000 on cherax, and it is configured to only allow access with key-based authentication. The ASC firewall has been configured to allow access to hpn-ssh from APAC-NF and iVEC. The end of the ssh connection receiving the data is the one where performance/buffering matters most so the following examples are for data transfers to cherax

cherax% module load hpn-ssh
cherax% scp -rp me123@ac.apac.edu.au:forcherax/. fromapac/.
cherax% rsync --rsh=ssh -av me123@ac.apac.edu.au:forcherax/. fromapac/.

Initiated from apac:

ac% scp -i ~/keys/my_cherax_key -P 22000 -rp forcherax/. \
           abc123@cherax.hpsc.csiro.au:fromapac/.
ac% rsync --rsh='ssh -i ~/keys/my_cherax_key -P 22000' -av forcherax/. \
           abc123@cherax.hpsc.csiro.au:fromapac/.

Related systems

Cherax also hosts the CSIRO Data Store.

The datastore is shared to the blade clusters burnet and nelson and the utility PC farrer/portal which has a common home directory for most users.

Accounting

Process accounting is available on the Altix system. See man csa. The main user tool is ja.

Monitoring Resource Usage

Each interactive and batch process you run has a time limit and a memory use limit imposed on it. Interactive sessions are limited to 30 Gbyte of memory, and individual processes to 30 minutes CPU time. These limit values may be revised in future.

You can get information about limits and resource usage using the following commands:

Reports on the Data Store usage can be seen at

CSIRO Data Store usage reports

Batch Use

All of the ASC systems (except the BoM solar system) use the Torque batch system derived from Portable Batch System (OpenPBS). OpenPBS. Other batch systems (SGE, NQSII, ...) provides similar functionality but with different specific commands and options.

Most jobs require greater resources than are available to an interactive session. A batch job is really a type of shell script containing a set of commands which are executed for you without any "terminal" interaction. Such job scripts must be submitted to the batch job system with the qsub command. The batch system manages efficient scheduling of running the submitted jobs on the available resources. The batch system also allows an interactive mode.

A shell script can be as simple as a sequence of commands written in a file or it can include more sophisticated use of flow control, variable substitution and error recovery. Here are some hints on error recovery in batch scripts.

You submit jobs to the Torque batch system using the command qsub specifying the number of CPUs, the amount of memory, and the length of time needed (and, possibly, other resources). The batch system runs the job when the resources are available, subject to constraints on maximum resource usage.

Interactive Batch Jobs

The qsub -I option will result in an interactive shell being started out on the allocated cpu once your job starts. A submission script is not used in this mode - you must provide all qsub options on the command line.

Your job is subject to all the same constraints and management as any other job in the same queue. In particular, it may be accounted for on the basis of walltime, since you may have dedicated access to the cpus reserved for your request. Don't forget to exit your interactive batch session to avoid leaving cpus idle on the machine, and unavailable to others.

Interactive batch jobs are likely to be used for testing or debugging large or parallel programs, but may also be used to run software that needs interaction to operate effectively or for work that is best done in an interactive mode. Since you want interactive response, it may be necessary to use a high priority queue (shorter jobs) to run promptly.

To use an X display in an interactive batch job, use ssh -X (or -Y) to login to the front-end machine or use a VNC session on the login node (do not change the DISPLAY variable ssh or VNC provides) and then submit your job with at least the following options to make the current DISPLAY environment variable be set in the batch job:

    % qsub -I -v DISPLAY

You will usually need to request some resources to get anough time and memory dedicated for your interactive task.

Basic commands

The basic PBS commands are:

qstat
Standard queue status command. See man qstat for details of options.
qdel jobid
Delete your unwanted jobs from the queues. The jobid is returned by qsub at job submission time, and is also displayed in the qstat output.
qsub
Submit jobs to the queues. The simplest use of the qsub command is typified by the following (PBS) example (Note that the job starts in your home directory so you must "cd" to a sensible directory. Also there is a carriage-return after ./a.out):
   % qsub -l walltime=20:00:00,vmem=300MB
   cd my_dir
   ./a.out
   ^D     (that is control-D)
or
   % qsub -l walltime=20:00,vmem=300MB jobscript
where jobscript is an ascii file containing the shell script to run your commands (not the compiled executable which is a binary file). More conveniently, the qsub options can be placed within the script to avoid typing them for each job:
   % cat jobscript
   #!/bin/sh
   #PBS -l walltime=20:00:00,vmem=300MB 
   cd $PBS_O_WORKDIR
   ./a.out
You submit this script for execution by PBS using the command:
   % qsub jobscript

Notice that the PBS commands are all at the start of the script, that there are no blank lines between them, and there are no other non-PBS commands until after all PBS resources are described. The variable, PBS_O_WORKDIR will be defined in the job as the directory from which qsub was run. This may or may not be where you want to "cd" to.

qsub options of note:

-j oe
Combine the standard output and standard error from the job into one file.

Batch resources

Cherax is a shared memory system but users are being allocated dedicated cpus (and shared memory). This means that the resources that are most important to specify are much the same as on a distributed cluster system where users can get exclusive access to compute nodes. The most important qsub resource options for cherax are:

-l walltime=20:00:00
The total wall time limit for the job. Time is expressed in seconds as an integer, or in the form: [[hours:]minutes:]seconds[.milliseconds]
-l vmem=??mb
The total memory limit for the job - can be specified with units of "mb" or "gb" but only integer values can be given. There is a small default value.
Your job will only run if there is sufficient free memory so making a sensible memory request will allow your jobs to run sooner. A little trial and error may be required to find how much memory your jobs are using - qstat -f and the job stdout footer list jobs actual usage.
-l nodes=nodespec
The nodespec specifies number and/or type of nodes needed by the job - but for cherax there is only one node. The nodespec should then be an integer specifying the number of nodes (must be 1) and an optional ppn=N (where N is an integer, default 1) specifying how many processors to request. eg. A parallel job needing 4 processes would specify nodes=1:ppn=4 No nodespec is required for serial jobs.
-l software=PACKAGE
The software corresponding to PACKAGE is required. Some software has limited numbers of licenses available so you need to tell the scheduler, so the jobs will not be started inappropriately. Currently managed software that needs such scheduling includes:
  • mathematica - one ASC served license is scheduled
Note that -l options may be combined as a comma separated list with no spaces, eg. -lvmem=500mb,cput=20:00.

Please don't use -l ncpus=N anymore, use -l nodes=1:ppn=N

Queues and Scheduling

Queue Structure

cherax has three execution queues:

Jobs submitted without a queue being specified will be piped to either the short or the batch queue, depending on limits requested.

Scheduling

The scheduling aims are to:

The batch job scheduler being used is Maui. Detailed knowledge of the scheduler is not necessary to run jobs but can help users understand what governs the order in which jobs are run.

From a user's perspective, it is very important that you minimize your requests for resources (i.e. walltime and memory). Otherwise your job may be queued or suspended longer than necessary. Of course, make sure you ask for sufficient resources - a little monitoring of resources used by past jobs might help.

Currently when a large job has been waiting in the queue for long enough, the system will attempt to get it to start by not starting other work or "draining". When the system is busy this is the most common reason for small jobs not starting promptly.

Maui works by assigning jobs to "reservations" which consist of a block of cpus for a period of time (defined by the requested walltime). The highest priority idle, queued job will get allocated a reservation in the future (and you can tell what time it will start by) and smaller/shorter lower priority jobs may be "backfilled" around the reservations.

Maui also allows "standing" reservations which can be used to reserve a block of resources for jobs which have particular attributes (eg. short jobs, jobs in a specific queue). Some cpus may be assigned to standing reservations. This is why sometimes queued jobs will not start, even if there are idle cpus, as the jobs does not have the right attributes to run in the reservation(s). This mechanism is used to prevent cherax being monopolised by long running jobs and to enable better turnaround for (shorter) development work.

There are a number of maui commands that add a scheduler aware interface to the batch system. These commands do not have man pages, but do have a -h help option and include:

showq
display queued jobs including priority order
showres
display reservations, including standing reservations
diagnose
display detailed information on the scheduler's state
checkjob
display information about a job, including why it is not running

File Systems

A number of file systems are available, each with a different purpose. Variables are defined at login to refer to the parts of the filesystems available to users.

In the table below: 'properties' denotes the Management attributes of the underlying filesystem: back-up (b), quota (q), local disk (l), job-temporary (j), flush (f), by arrangement (a) and/or migrated (m).

Variable nameproperties purpose
$HOME m, q, b login settings and persistent backed up large capacity storage
$FLUSHDIR q, f working files (semi-)persistent between sessions.
Ensure that critical files left here are backed up elsewhere
$DATADIR q persistent files for use in multiple jobs
$TMPDIR q, j job-temporary files - automatic cleanup

Flushing is implemented on the $FLUSHDIR area based on necessity but with a minimum lifetime of 7 days. Files newer than the minimum lifetime will never be (automatically) flushed.

$DATADIR can be used to hold persistent files which will not be migrated, eg. this would be useful for user installed software or data files which are to be used repeatedly over a period during which they would otherwise be repeatedly migrated and restored. Ensure that critical files left here are backed up.

For information about how the migrating file system works, please see the companion guide to the Data Store.

Use the quota command to see your usage and the limits on each file system. Note that there are quotas on both the space occupied and the number of inodes (loosely, the number of files).

Compiling

Compilers and Options

  1. The Intel Fortran compiler is ifort.
  2. The Intel C compiler is icc.
  3. The Intel C++ compiler is icpc or just icc.

There are multiple versions of the compilers available and you can set up your environment for a particular version using Modules,

eg. module load intel-fc/9.1.033

Compiling hints

  1. The default architecture options are good - so you do not need to specify any.

Using MPI

MPI is a parallel program interface for explicitly passing messages between parallel processes - you must have added message passing constructs to your program. Then to enable your programs to use MPI, you must include the MPI header file in your source and link to the MPI libraries when you compile.

Compiling and linking

The only option that is needed on the altix to compile MPI programs id the link option -lmpi.

Running MPI jobs

Use the mpirun command to start an MPI executable, both when running within a batch job and when running small interactive test jobs.

To run a small test with 4 processes (or tasks) where the MPI executable is called a.out, enter the command:

    % mpirun -np 4 ./a.out
For larger jobs and production use, submit a job to the batch system with a command like
    % qsub
    cd $PBS_O_WORKDIR
    mpirun -np 4 ./a.out
    ^D
    %

Further mpirun details

You can use dplace with mpirun to make the worker mpi processes bound to cpus - and increase the efficiency of memory access. The example becomes:

   % mpirun -np 4 dplace -x1 ./a.out
   
the -x1 option skips the second mpi process which is a 'shepard' process which contributes little to the numerical work and need not be bound.

Controlling MPI execution

A number of environment variables are available to control the behaviour of MPI jobs. These are described in the MPI man page.

Common problems

The amount of memory that MPI executables on cherax set aside can be large. See the point above for information on environment variables to control this.

Using OpenMP

OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelize loops and specify certain properties of variables.

Compiling and linking

Fortran and C with OpenMP directives are compiled as:

    % ifort -openmp myprog.f -o myprog.exe
    % icc -openmp myprog.c -o myprog.exe

Compiling hints

  1. Static linking to OpenMP libraries is not recommended, it can cause performance problems as more than one copy of the library can be linked in. If using -fast, replace with -ipo -O3
  2. If you have problems try increasing your thread stack size by setting KMP_STACKSIZE in the environment (see the Intel compiler documentation).

Running OpenMP jobs

To run the OpenMP job interactively, first set the OMP_NUM_THREADS environment variable then run the executable:

    % env OMP_NUM_THREADS=4 ./a.out

For larger jobs and production use, submit a job to the PBS batch system with something like

    % qsub -l nodes=1:ppn=4,walltime=30:00
    #!/bin/sh
    OMP_NUM_THREADS=4
    export OMP_NUM_THREADS
    cd $PBS_O_WORKDIR
    dplace -x2 ./a.out
    ^D
    %

Note: dplace is used to make the spawned threads bound to cpus and increase the efficiency of memory access. The -x2 option skips the 2nd thread which is a 'helper' thread which contributes little to the numerical work and need not be bound.

Common problems

Two of the most common problems encountered in parallelizing code in shared memory (or porting paralleized code) are stack issues due to the multi-threaded parallel execution model and data scoping issues which may manifest as uninitialized or over-written variables.

Autoparallelizing with the compilers

The intel compilers can automatically parallelize code at the level of loops, to run in shared-memory. The results can be good - but only for a relatively small class of codes. In general, parallelization is most effective when applied at the highest possible level and a better result (than automatic parallelization) can be acheived by adding OpenMP directives and 'helping' the compiler to identify parallelizable code. You can try automatic parallelization and test if you get any speedup. Also you can use the information from the compiler as a start for adding OpenMP directives and to identify code that inhibits parallelism. The intel compiler options are:

     % ifort -parallel prog.f
or
     % icc -parallel myprog.c
By default this reports to the screen which loops were parallelized. More information can be obtained by using the -par_report options. To run on 4 processors do
     % env OMP_NUM_THREADS=4 time ./a.out
The time output will show the cpu and elapsed execution time.

See the Intel Fortran manual for more details on use.

Code Development

Debugging

The idb and gdb debuggers are provided on the Altix system for C, C++ and Fortran. They support the debugging of simple programs and core files, and code with multiple threads. Using idb and gdb is similar:

  1. Compile and link your program using the -g switch e.g.
    	% icc -g prog.c
    	
  2. Start the debugger
    	% gdb ./a.out
    	
  3. Enter commands such as
    	(gdb) run
    	(gdb) print var
    	(gdb) quit
    	
Note: To debug using a core file after a program has crashed, you will need to change the core file limit for your shell prior to running your program:
	csh syntax: limit coredumpsize unlimited
	bash syntax: ulimit -c unlimited
	

The totalview debugger is also available and provides a rich set of debugging features via an interactive GUI (and command-line). This tool is effective for more complex debugging and in parallel codes.

Debugging Parallel programs

For debugging OpenMP and MPI codes we recommend using the Totalview debugger. To use the Totalview debugger, compile your code with -g then in an interactive batch job load the totalview module:

        % ifort -g prog.c
        % module load totalview
        % totalview ./a.out -a arg1 arg2
        
and for MPI codes:
        % ifort -g mpiprog.c -lmpi
        % module load totalview
        % totalview mpirun -a -np 1 ./a.out arg1 arg2
        

Profiling

There are few profiling tools of varying levels of sophistication available on the Altix system. Suggestions for their use follow.

  1. To find which routines are the most time-consuming and where they are called from compile with -p and run the gprof profiler to create a program profile including call tree in prog.gprof with profiling data stored in gmon.out:
    	% ifort -p -o prog.exe prog.f
    	% ./prog.exe
    	% gprof ./prog.exe gmon.out > prog.gprof
    	

Profiling MPI Code

We can work on installing MPE if there is demand or purchasing Vampir or Blade.

Other Documentation

The SGI Altix documentation is visible to CSIRO and authenticated users. There is an SGI guide for linux application tuning.

Man pages are available. Software often has documentation included in the distribution and installation. You can look in the install tree on the system.

Known Problems

Changelog

Here is a list of recent updates in this userguide for quick reference for users returning to this guide.

To Do

Here is a list of pending updates to this userguide.