Before Starting
About this Guide
This document contains a list of known problems and a change log which can be checked for a summary of recent updates.
If you have loaded this guide using the index page you will have frames with a table of contents on the left and the userguide on the right. The guide(s) will also work fine without frames but will not be as easy to navigate.
The guides are intended to be introductory in nature though they will provide references to core documentation and detailed site specific information and FAQs.
More formal vendor supplied documentation is referenced here.
Registering to use the SUN Constellation
All users need to be registered to use the HPCCC systems. Registration is available to Bureau staff, and to CSIRO staff within CAWCR or the HPCCC.
Users can request an account by completing the online application form.
Getting Help with a problem or query
Before contacting the HPCCC please confirm that your problem and workaround is not listed in the known problems.
System Advisory and Change Notices for solar are published on the solar wiki (Note: this link only works from within the BOM network). The solar wiki also contains other useful transient information about the system.
Users experiencing any problems with HPCCC systems should log an incident using iSupport :
Problems reported via a web browser or email will be entered in the iSupport incident tracking system, and then can be made visible to the staff most able to solve the problem. You can use the web interface to check progress of a problem and/or follow up the request by replying to email sent to you about the request. Please only include immediately relevant history in your reply as otherwise the information is duplicated in the system and the problem becomes difficult to follow. We have also written some further guidelines in the faq on using the iSupport system.
CSIRO staff may also request help via the usual CSIRO ASC support channels.
Getting Started
About the SUN Constellation
The SUN Constellation consists of 576 nodes, each with 2 quad-core Intel 64-bit Xeon processors (code named Nehalem), totalling 4608 CPU cores.
Each node has 24 Gbytes of main memory and 24 Gbytes of flash memory instead of local disc. All of the nodes are connected by a dual-rail Infiniband network, with data rates of 40 Gbit/s per connection.
The system runs the CentOS distribution of Linux, Sun Grid Engine for job management and uses the Lustre global file system comprising of 115TB of disk space.
In addition, there are 4 user login nodes, and 6 data-mover nodes.
Connecting to the SUN Constellation
Bureau users can log directly into the SUN Constellation login nodes from their desktop or other Bureau systems. However, the SUN Constellation system is not directly visible from the entire CSIRO network, so CSIRO users will need to first login to cherax.
The front-end hostnames of the SUN constellation are:
CSIRO users should use the unqualified names "solar" or "solar-dm" from cherax and connect back to "cherax" as necessary.
You must use secure shell (ssh) to connect to the SUN Constellation. Follow the links in the FAQ to learn more about using ssh.
Warning: Login nodes should not be used for compute or memory intensive processes as it locks out other users and can result in a reboot. Please use the Sun Grid Engine (SGE) batch queuing system.
For applications that require a graphical interface, ssh can be used in conjunction with X server software. X forwarding needs to be enabled at each step from your X server to the SUN Constellation so from cherax, connect with "ssh -X solar"
BOM users will use standard BOM names as usernames.
CSIRO staff should use the CSIRO NEXUS (staff) identifiers as usernames. Your initial password will be provided when your account is created, and should be changed when you first login. Note: The password is different from your NEXUS password.
To change your password or login shell - login to solar1.bom.gov.au and use the passwd and chsh commands as usual. It will take up to 1 hour for the change to be propagated to other nodes.
For CSIRO users the system uses group names of the form csxyz, where xyz is typically an acronym for a CSIRO business unit.
Users without CSIRO or BOM usernames will be given other usernames and a password upon registration.
We can create other group names for projects crossing boundaries, and for projects within business units.
File Transfer
All external file transfers to/from solar should be done on the data mover nodes using the SGE batch system. A queue dm is available for general access and a queue dmop for operational systems.
rsync is the preferred way to transfer files in and out of the SUN Constellation as well between different lustre file systems, in particular for:
scp is recommended when transferring a small number of files interactively to another computer, in particular for:
Note: scp is NOT recommdended for transferring files on the same system, e.g.: transferring files from lustre file systems on Solar.
For interactive transfers on the login nodes you should always use qrsh to do scp or rsync's.
For example, to copy files from solar to flurry:
solar> qrsh -q dm scp /full/path/to/myfile flurry-bm:destdirTo copy files from flurry to solar:
solar> qrsh -q dm scp flurry-bm:myfile /full/path/to/destdir
You can simplify the command for copying from solar to flurry to:
solar> cd /full/path/to solar> qrsh -cwd scp myfile flurry-bm:destdirand to copy files from flurry to solar:
solar> cd /full/path/to solar> qrsh -cwd scp flurry-bm:/full/path/to/myfile destdir
For batch job transfers you can use qrsh -q dm or qrsh -q dmop (operational only). This will generate a subsidiary SGE job on a datamover node to run the scp or rsync command. When this job finishes the calling job continues. This datamover job is visible using qstat.
Warning: Do not use backgrounding, &, as the qrsh immediately dies after submission to the dm queue !
If the file or file destination of your transfer is non-existent then the qrsh command returns an exit code of 1, if it works successfully then an exit code of 0 is returned. Test the result using standard shell script exit codes.
When doing bulk transfers and backups it is recommended that rsync is used and with the following flags:
-av --stats
solar> qrsh rsync -av --stats /g/sc/data1/user/foo user@flurry1:/flurry/home/userThis will copy all the directories recursively to flurry. It will also create a directory called foo in /flurry/home/user
This example SYNC's data from one directory to another on flurry. It mirrors all the changes done on SUN Constellation on the other system. This command also increments the files once new ones are created on the source computer.
solar> qrsh rsync -av --delete --delete-after --stats /g/sc/data1/user/foo user@flurry1:/flurry/home/userWarning: This command is intended to be used when you need to sync and create a mirror of a directory on a remote computer.
This example recursively transfers data from a home directory to a flush directory on solar.
solar> qrsh rsync -avzi --stats /home/user/foo /g/sc/flush/user/bar
If the directory has *lots* of files and subdirectories (in the order of thousands) then add the -S option (for sparse files).
solar> qrsh rsync -avzi -S --stats /home/user/foo /g/sc/flush/user/bar
And also redirecting the statistics (both standard error and standard output) to a file is recommended:
solar> qrsh rsync -avzi -S --stats /home/user/foo /g/sc/flush/user/bar >& rsync-report.log
The SUN Constellation is not directly visible from the entire CSIRO network, so CSIRO users will need to transfer files via cherax.
To transfer files from your host to the SUN Constellation:
myhost> scp ./my_file cherax:
myhost> ssh cherax
cherax> scp ./my_file solardm.bom.gov.au:
For large numbers of smaller files, rsync may be a better option - for more information about this and setting up ssh on ASC and partner systems read the FAQ.
It is possible, albeit highly difficult, to set up ssh configuration
to chain ssh connections between your host and the SUN Constellation.
- the O'Reilly SSH book is recommended reading for advanced users who wish to
do things like this - but in general we recommend manually transferring files
via cherax.
Interactive and Batch
The operating system on the SUN Constellation is Linux. For information about setting up and using your Linux/Unix environment please read the ASC FAQ.
Almost all work on the SUN Constellation should be done via the Sun Grid Engine (SGE) batch queuing system. Debugging and other large-scale interactive work can be done interactively on the compute nodes with an interactive batch job.
Compilation however should be done interactively on the front-end nodes - compilers are not available from the compute nodes or via the batch queues.
On the login nodes the default stack limit (set by the shell) is about 10MB. If you are getting mysterious segmentation faults you may be hitting the stack size limit, in which case:
and, if necessary, modify them with
The startup scripts are processed in the following order:
Software Packages
To customize your environment to use particular software packages the Environment Modules utility is available. For a summary of useful commands see the Environment Modules page.
Run module avail to see what software packages are available and module load package to set up your environment for a particular software package.
Modules Available (default version in brackets):
A number of third-party software packages are maintained by NMOC, and can be accessed with:
solar> module use /g/sc/ophome/nmoc_share/local/NMOC_modulesTo see the list of available NMOC software use the command:
solar> module availSoftware available includes:
For CAWCR users, ACCESS software is available under /access
File Systems
Lustre Filesystem
Lustre is a high performance parallel file system and is used for all user file systems. A number of file systems are available, each with a different purpose - the appropriate file system should be used whenever possible. All Lustre files ystems are visible to all compute and data mover nodes.
For applications that have significant I/O requirements it may be necessary to tune Lustre, for example through striping your data across disks, to obtain improved performance. Please contact solarhelp@bom.gov.au for assistance.
Users are advised to specify environment variables and logical paths in their scripts, and NOT the physical paths.
In the table 'properties' denotes the intended management attributes of the underlying filesystem, back-up (b), quota (q), flush (f).
| Variable name/symlink | properties | logical path | mount | purpose |
|---|---|---|---|---|
| $HOME | q, b | /g/sc/home/$USER | /g/sc/home1 | login settings and persistent backed up storage |
| $FLUSHDIR | f | /g/sc/flush/$USER | /g/sc/flush | working files (semi-)persistent between sessions ensure that critical files left here are backed up |
| $DATADIR | q | /g/sc/data/$USER | /g/sc/data1 | persistent files for use in multiple jobs |
| $CWSHARE | b | /g/sc/home/cawcr_share | /g/sc/home1 | CAWCR shared software |
| $NMOCKEEP | b | /g/sc/ophome/nmoc_share | /g/sc/home2 | NMOC software |
Warning: Do NOT use $TMPDIR, $TMP or /tmp in your scripts. These are reserved for system purposes. Using these may result in jobs crashing!!!
Note: Quotas are not currently active on the $HOME and $DATADIR file systems and when introduced will be on a group basis.
Flushing starts when usage of the flush filesystem is greater than 87% and attmepts to reduce the filesystem usage below 80%.
It will delete files older than 40 days on the first run, after that it will check if the usage has reached 80%. If not, it will delete files older than 30 days in the second run. If 80% is still not reached it will delete files older than 25 days on the third run.
If 80% is not reached by the three attempts then it will exit and email root-user that fair usage has not been reached.
Note: There is no quota on $FLUSHDIR at present. It is expected that quota will be 100GB. Please clean up after your use of $FLUSHDIR and remove unwanted data - this will increase the longevity of files in $FLUSHDIR, to everyone's benefit.
Note: The directory solar:/opt/sun/logs/flush_disks contains a daily record of what has been flushed for $FLUSHDIR.
File Systems and Backups
The tape backup schedule for solar is as follows:
rsync backups for solar are run twice daily and after specific operational tasks are complete.
| mount | logical path | size (TBytes) | tape backup schedule | rsync backup area | rsync backup schedule |
|---|---|---|---|---|---|
| /g/sc/home1 | /g/sc/home | 12 | daily incremental, weekly full | /g/sc/ophome/home_bkp | twice daily, and after specific operational jobs |
| /g/sc/home2 | /g/sc/ophome | 12 | daily incremental, weekly full | /g/sc/home/ophome_bkp | twice daily, and after specific operational jobs |
| /g/sc/data1 | /g/sc/data | 23 | No backup | ||
| /g/sc/data2 | /g/sc/opdata | 23 | daily incremental, weekly full* (*see note below) | /g/sc/data/opdata_bkp | twice daily, and after specific operational jobs |
| /g/sc/flush | /g/sc/flush | 35 | No backup | ||
Note: /g/sc/opdata tape backup is only for 4 specific directories:
Compiling and Running Programs
Compilers and Options
Important: Load a compiler module before loading an mpi module!
Compiling: tips and tricks
If you don't actually need 8-byte precision for real and integer data, you will generally get better performance and portability by not using it.
Optimized Math Libraries
MKL is Intel's math library optimised for the Intel compiler. It includes the BLAS, Sparse BLAS, LAPACK, PBLAS, and ScaLAPACK routines. The MKL library also includes FFT routines, some sparse solvers and a vector statistical library which provides random number generators.
To use MKL, first load the module:
module load libs/mklthen link against the MKL libraries, for example:
ifort test.f90 -L$MKL_LIB -lmkl_intel_lp64 -lmkl_sequential -lmkl_coreNote: The correct set libraries for MKL varies depending on compiler options. The Intel developer site has a link line advisor tool for what libraries should be used depending on your software environment.
For full details refer to the HPCCC User Documentation page.
The Sun Performance Library contains enhanced versions of the following standard libraries:
The sunperf library is provided by the SUN compiler module. To use sunperf, first load the module:
module load compiler/sunthen link against the sunperf library for example:
f90 test.f90 -lsunperf
For full details refer to the
HPCCC User Documentation page.
Running Programs on the Sun Constellation
Programs should be run on the compute nodes through the
SGE Batch System.
Interactive work such as debugging can be run via the batch system with an
interactive batch job.
Portability of programs and data from the SX to the Sun Constellation
There are notes in the FAQ about data and program portability between the SX-6 and the Sun Constellation.
SGE Batch Use
Most use of the SUN Constellation should be through the Sun Grid Engine (SGE) batch system with jobs submitted from login node(s) or from within other batch jobs.
You submit jobs to SGE specifying the number of CPUs, the amount of memory, and the length of time needed (and, possibly, other resources).
Note: Batch jobs using bash or ksh do not read /etc/profile so module commands are not in the default path. To access module or SGE commands add:
. /etc/profileto your job script.
SGE Projects
The use of project names in your SGE jobs on solar will become compulsory before the end of the year. If you do not specify a project name, your job will be assigned to the 'general' project and will be limited to the default minimum resources.
For a list of available projects and their permitted users see - Project Spreadsheet on the solar wiki and email solarhelp@bom.gov.au for any changes or problems.
For details on how to specify a project in a batch job - see Options for qsub
Basic commands
The basic SGE commands are described in their man pages.
Submit jobs to the queues. The job starts in your home directory or the current working directory using the -cwd option. The simplest use of the qsub command is typified by the following single core example job:
% qsub -l h_rt=00:10:00 -cwd ./a.out ^D (that is control-D)or
% qsub -l h_rt=00:10:00 jobscriptwhere jobscript is a text file containing the shell script to run your commands (not the compiled executable which is a binary file). More conveniently, the qsub options can be placed within the script to avoid typing them for each job:
% cat jobscript #!/bin/ksh #$ -V # exports all env. variables #$ -cwd #$ -N test #$ -o stdout.$JOB_ID #$ -e stderr.$JOB_ID #$ -q normal #$ -l h_rt=00:10:00 ./a.outYou submit this script for execution by SGE using the command:
% qsub jobscript
Queue Structure
There are a relatively small number of entry point queues. The names for these queues are as follows:
Options for qsub
qsub options should be placed at the beginning of the script, with the sentinel #$ . Options can also be specified on the command-line and will take precedence over those embedded in a script.
Default values can be specified in a default file, named .sge_request
If more than one of these files is available, the files are merged into one default request, with the following order of precedence:
Note: Options embedded in a script and the qsub command line have higher precedence than the default request files. Thus, script embedding overrides default request file settings and the qsub command line options overrides these settings again.
Note: To discard any previous settings, use the qsub -clear command in a default request file, in embedded script commands, or in the qsub command line.
Some useful, common options for a qsub script are:
Warning: Use of this option in jobs that re-submit themselves and initialize their environment by appending strings to existing variables may lead to some variables exceeding the maximum length that the shell used in the script can handle. This may cause transient failures of your jobs that would otherwise run correctly if submitted manually.
Specifically, if the length of PATH exceeds the max length, the shell may not be able to find standard Linux commands (ls, mkdir, etc.) which would cause your job script to fail.
To avoid this problem you can use -v var=val,... option (see below) to specify exactly which variables are exported.
Note:This option is "small" v as opposed to capital V that exports all variables!
Note: To add a user default project for all batch jobs, put an entry in your $HOME/.sge_request file
Also, you can create a .sge_request file in the current working directory where you issue the qsub command. This file will have precedence over $HOME/.sge_request
Run the job in a specific parallel environment. There is only one parallel environment defined on solar - -pe mpi numcores. You must use this for MPI or OpenMP jobs
Note: Do NOT use user@host.domain addresses. All users must have a $HOME/.forward containing their email address.
Alternatively, mail can be sent within your batch job using mail or mailx, but you need to ensure that the recipient list contains usernames and NOT email addresses. Warning: If your mail command is near the end of your batch script, put a "sleep 5" after the mail command to ensure it sends the mail before the job terminates.
Note about stdout/stderr:
Some useful, but less common options for a qsub script are:
Example Batch Job
An example single core job script:
#!/bin/ksh #$ -V # export all environment #$ -N test #$ -o stdout.$JOB_ID #$ -e stderr.$JOB_ID #$ -q normal #$ -l h_rt=00:10:00 . /etc/profile module load compiler/intel ./a.out
Interactive Batch Jobs
Debugging and other large-scale interactive work can be done interactively on the compute nodes with:
solar> qrsh -q int -pty n
Note: The -pty n option is required, otherwise SGE will consume extra CPU resources.
Options for qstat
Some useful options for a qstat are:
MPI, OpenMP and ensemble/background Jobs
To use multiple processors you must specify a parallel environment with the -pe qsub option. Only one parallel environment, mpi, has been defined, and it should be used for OpenMP as well as MPI jobs:
#!/bin/ksh #$ -V #$ -cwd #$ -N test #$ -q normal #$ -pe mpi 4 module load compiler/intel export OMP_NUM_THREADS=4 ./a.out
#!/bin/ksh ... #$ -pe mpi 16 #$ -l excl=true . /etc/profile module load compiler/intel mpi/sun /opt/bom/bin/mprun.py ./a.out
Jobs running multiple simultaneous models should also use the parallel environment to ensure each process gets it's own core:
#!/bin/ksh ... #$ -pe mpi 4 ./a.out & ./a.out & ./a.out & ./a.out & wait
Scheduling issues
From a user's perspective, it is very important that you minimize your requests for resources (i.e. walltime and memory). Otherwise your job may be queued longer than necessary. Of course, make sure you ask for sufficient resources - a little monitoring of resources used by past jobs might help.
Submitting Jobs from other hosts
Jobs can be submitted to solar from external hosts. For example, to submit solar jobs from flurry execute the following command:
flurry> env SGE_CELL=solar qsub jobscript
and to check your job with qstat execute the following command:
flurry> env SGE_CELL=solar qstat
Using MPI
MPI is a parallel program interface for explicitly passing messages between parallel processes - you must have added message passing constructs to your program and include the MPI header file in your source. Then compile and link with an environment set up to make MPI available.
Compiling and linking
OpenMPI is the recommended and standard MPI library to use on solar. All operational and operational development systems MUST use OpenMPI.
Simply typing module load compiler mpi will load the SunStudio modules
For Fortran, use this include directive in the source (in either fixed- or free-form) of any program unit using MPI:
INCLUDE 'mpif.h'
and compile with a command like:
% mpif90 myprog.f -o myprog.exe
% mpiifort myprog.f -o myprog.exe
#include <mpi.h>
and compile with a command similar to:
% mpicc myprog.c -o myprog.exe
% mpiicc myprog.c -o myprog.exe
Intel MPI
Anyone requiring access to Intel MPI must first contact the HPCCC and provide sufficient reason. If access is granted, any such use will be on a limited set of nodes for a short period of time.
The default Intel MPI (3.2.2) on solar cannot use the dual rail Infiniband network, the test version (4.0 beta) supports dual rail but is not official yet. The use of Intel MPI on solar is experimental only until the official release of the new version and its installation and testing is complete.
Note: NOT mpif90 - This would use the GNU Fortran compiler.
Running MPI jobs
Use the mprun.py command to start an MPI executable when running within a batch job.
Note: By default mprun.py is not in your $PATH, so you must either add it to your $PATH or run with the full pathname, as in the example below.
Submit a job to the batch system with a command like:
% cat myjob.q
#!/bin/ksh
#$ -pe mpi 8
#$ -l excl=true
. /etc/profile
module load compiler/intel mpi/sun
/opt/bom/bin/mprun.py ./a.out
% qsub myjob.q
In the above example one process will be started on each requested core.
Some useful parameters to mprun.py are:
The -ppn option is valuable if you wish to distribute work evenly over nodes. For example, to distribute a 16-process MPI job evenly across 2 nodes, use:
#!/bin/ksh
#$ -pe mpi 16
#$ -l excl=true
. /etc/profile
module load compiler/intel mpi/sun
/opt/bom/bin/mprun.py -ppn 8 ./a.out
This requests 2 full nodes (-pe mpi 16 and -l excl=true) and starts 8 MPI processes on each.
For an asymmetric allocation of MPI processes to nodes use the -h <logical node number> -n <number of processes> options. For example, for a 2 node job that requires 4 processes on one node and 7 on another node, use:
#!/bin/ksh
#$ -pe mpi 16
#$ -l excl=true
. /etc/profile
module load compiler/intel mpi/sun
/opt/bom/bin/mprun.py -h 0 -n 4 -h 1 -n 7 ./a.out
For a job that requires whole nodes but not all cores on each node, you can use the -ppn option. For example, for a 2 node job with 6 processes on each node for a total of 12 processes, use:
#!/bin/ksh
...
#$ -pe mpi 16 # Request 16/8=2 nodes
#$ -l excl=true
. /etc/profile
module load compiler/intel mpi/sun
/opt/bom/bin/mprun.py -ppn 6 ./a.out
Hybrid MPI and OpenMP
If your job uses OpenMP within each MPI task, specify the total number of cores needed with #$ -pe mpi, use OMP_NUM_THREADS for the number of OpenMP threads per process and the -ppn option to mprun.py to start the appropriate number of MPI tasks:
#!/bin/ksh
#$ -pe mpi 16
#$ -l excl=true
. /etc/profile
module load compiler/intel mpi/sun
export OMP_NUM_THREADS=4
/opt/bom/bin/mprun.py -ppn 2 ./a.out
Warning: if OMP_NUM_THREADS multiplied by -ppn is greater than -pe mpi then the threads will interfere with each other whilst contesting the insufficient number of allocated cores.
Using OpenMP
OpenMP is an extension to standard Fortran, C and C++ to support shared memory parallel execution. Directives have to be added to your source code to parallelize loops and specify certain properties of variables.
Compiling and linking
The C/C++ and Fortran compilers from both Intel and Sun accept the -openmp and -xopenmp options respectively to honour OpenMP directives and generate parallel code.
Intel Compiler:
% $(COMPILER) -openmp ...
SUN Compiler:
% $(COMPILER) -xopenmp -L/opt/sun/sunstudio12.1/rtlibs/amd64 ...
Running OpenMP jobs
To run an OpenMP job, first set the OMP_NUM_THREADS environment variable and then run the executable. See MPI, OpenMP and ensemble/background Jobs for notes and examples of running an OpenMP job under SGE.
Performance may be improved by pinning OpenMP threads to processors - this ensures that the same thread always executes on the same CPU, thus ensuring that the local cache retains relevant data for the thread.
To enable thread pinning (also called processor affinity) under Intel OpenMP, set the environment variable KMP_AFFINITY to either compact (neighbouring threads will occupy neighbouring processors) or scatter (spreads the threads as widely as possible - specify this if you are requesting more cores than you will use, to give each thread a larger amount of dedicated local cache)
Using SunStudio, set SUNW_MP_PROCBIND to TRUE to
use thread pinning.
Common problems
One of the most common problems encountered after parallelizing a code is the generation of floating point exceptions or segmentation violations that were not occurring before. This is usually due to uninitialized variables - check your code very carefully.
Other Documentation
More detailed SUN Constellation documentation is available from the HPCCC User Documentation page.
Known Problems
A bug in the lustre file system can cause a FORTRAN program to report a 'file not found' when opening a file even if the file exists.
This happens in the following circumstances:The bug is already fixed in the current lustre development trunk, but it will take some time before an updated lustre version is installed on solar.
So for now users have to use one of the following works around:
This is the preferred solution, since it will increase the readability of the code (it documents that this file is read and not written), and is slightly faster (it uses less system calls to actually open the file - not that this will actually be measurable).
Other options would be to make the file writeable, or only read from one MPI process and distribute the contents via MPI calls, but this solution appears to be inferior to the ones listed above.
Changelog
Here is a list of recent updates in this userguide for quick reference for users returning to this guide.
To Do
Here is a list of pending updates to this userguide.