Bulletin 120 - 2004 Jul 21

  1. Moving of HPCCC staff and offices
  2. New cluster systems
  3. CSIRO Applications Software Initiative
  4. SYSTEM CHANGE NOTICE 2004-C001 Separation of CSIRO multi-CPU jobs
  5. SGI Altix (cherax) updates
  6. SYSTEM CHANGE NOTICE 2004-C003 Upgrade to CSIRO SGI Altix
  7. SYSTEM CHANGE NOTICE 2004-C002 Local Corrections and tempd
  8. SYSTEM CHANGE NOTICE 2004-B002 SXCross structure change
  9. Interactive use of the SX-6 nodes
1. Moving of HPCCC staff and offices

The HPCCC offices and staff are scheduled to move from 150 Lonsdale Street to 700 Collins Street this week.

Please continue to send in reports of problems or requests for help, but we will not see some of the requests for a while, and will be slower to respond.

We expect 'phone numbers and e-mail addresses to remain the same, but CSIRO staff members may have subsequent 'phone number changes to enable access to the CSIROTel network. The postal address will stay the same, but the street address will be:

HPCCC
Level 11
700 Collins Street
Docklands Vic 3008.

The fax may not work for some time, and the number might change from 03 9669 8112 - please send any pending registration requests as soon as possible, and then please call us first if you need to send a fax.

2. New cluster systems

CSIRO has taken delivery of two IBM eServer Cluster 1350 systems.

The first system is for CSIRO Marine Research and CSIRO Atmospheric Research for an external project, and the second system has been acquired by the HPSC.

Each system is comprised of nodes with dual 3.2 GHz Xeon processors, each with a head node, and 20 computational nodes (cluster one, "nelson"), and 32 computational nodes (cluster two, "burnet".)

The total peak speed is 691.2 Gflop/s.

Each system is capable of expansion to 140 processors in a single rack, or 1.8 Tflop/s peak total.

The HPSC will make the second cluster available to targetted users, particularly for those applications which have not previously been ported to HPC systems.

3. CSIRO Applications Software Initiative

CSIRO HPSC has announced an initiate to make a large range of third-party software available to users.

Initially, Gaussian, GaussView and a large range of 'engineering' software from MSC will be available, e.g. Nastran. The licensing for the MSC is based on tokens, with the full range of software being installed, but only a limited number of tokens at a time being available to run the software. The software will be able to be run on local machines as well as on the HPSC systems.

Because of the cost of the licences, CSIRO HPSC is proposing a software access agreement and a flexible means for users and groups to contribute to the costs.

Please contact Len Makin on 03 9669 8109, Len.Makin@csiro.au for further information.

4. SYSTEM CHANGE NOTICE 2004-C001 Separation of CSIRO multi-CPU jobs

TARGET DATE OF CHANGE: 2004-07-22 10:30 AEST but may be delayed

SYSTEMS AFFECTED: sx600 sx601 sx602 sx603 sx604

SOFTWARE AFFECTED: NQS II job servers

IMPACT: At frequent intervals, the CSIRO workload is badly distributed across the five nodes. The Ocean development multi-node job runs on nodes sx600, sx601 and sx602, taking 7 CPUs on these nodes for most of the time. Other multi-CPU jobs on these three nodes get CPU time only briefly. while the nodes sx603 and sx604 are often under-utilised.

Until the SYNC patch is installed to improve file system reliability, we will not invoke job migration. As an interim measure, we will stop jobs in the queue csml from using nodes sx600, sx601 and sx602. This will give better turnaround for these jobs.

DETAILS: The csml queue will be stopped, and csml jobs drained from nodes sx600, sx601 and sx602. (The ocean job will be suspended during this process to speed the draining up). The nodes sx600, sx601 and sx602 will then be removed from the csml queue.

CONTACT: Rob. Bell.

NOTICE APPROVAL: Rob. Bell.

5. SGI Altix (cherax) updates

5.1 Retrieval problem

There is a current problem on the Altix where attempts to execute a migrated file fail. A message of the form:

 ./a.out: Input/output error.

is issued when the retrieval fails (usually after several minutes, as the system tries to retrieve from the disc cache and two tape types).

We will get the problem addressed by SGI. In the interim, please issue a dmget command before trying to execute a migrated file.

5.2 Locality problems

Although the Altix provides a globally addressable memory, the memory is in fact divided into separate memory around each node. Users may notice variation in the execution time of programs. This can be caused by locality issues: when the code is executed on one processor, but the code and data are actually stored at a far-off node, the execution time can be around twice as long.

There are controlling mechanisms for this (cpu sets), and we are investigating in conjunction with some users.

5.3 Slow compilation with Intel Version 8 compilers

We noticed some long compilation times with the Intel version 8 compiler, and investigated. It was an issue with access to our new licence server, and is now resolved.

5.4 Batch jobs, interactive usage and limits

In order to preserve good responsiveness for many users on cherax, it is important that major tasks be run in the batch system.

To help this process, we would like to implement some interactive limits on Saturday 2004-07-31.

We suggest initial limits of 30 CPU minutes per process.

Interactive work will fall under a bootcpuset (system + interactive) to provide better control.

See also the following item.

6. SYSTEM CHANGE NOTICE 2004-C003 Upgrade to CSIRO SGI Altix Programming Environment

TARGET DATE OF CHANGE: Saturday 2004-07-31 10:00 - 18:00

SYSTEMS AFFECTED: cherax

SOFTWARE AFFECTED: SGI ProPack

IMPACT: cherax will be unavailable from 10:00 - 18:00 (eight hours) on Saturday 2004-07-31. This will affect other systems which cross mount file systems from/to cherax, and SX-6 users who are expecting to transfer files to or from cherax as part of a job to be run during this time. Potentially all CSIRO HPSC and some Bureau users.

DETAILS: SGI Propack 3.x provides enhancements designed for the Altix.

For Users:
Application performance measuring tools - VTune, pfmon; NUMA tools - dlook, dplace, etc.;
Performance Co-Pilot: performance monitoring and management; runon - Enables running a command on a particular CPU or set of CPUs.


For System Administrators:
Array Services - tools with kernel support that simplify the management of systems and parallel applications;
CpuMemSets - provide kernel support and infrastructure for implementing processor and memory placement;
Cpuset System - used to create a division of CPUs within a larger system;
FLEXlm - floating licence service, run-time environment; kdb - kernel debugging;
Kernel partitioning - support for a partitioned system, including cross-partition communication support;
LKCD - system crash dump analysis tools;


Some additional locality testing may also be carried out depending on time available.

The outage may take less than 4 hours in which case the system may be returned to service earlier than expected.

Interactive limits will be imposed.

CONTACT: Jeroen van den Muyzenberg

NOTICE APPROVAL: Rob. Bell.

7. SYSTEM CHANGE NOTICE 2004-C002 Local Corrections and tempd

TARGET DATE OF CHANGE: 2004-07-27 10:30 AEST

SYSTEMS AFFECTED: SX600 SX601 SX602 SX603 SX604

SOFTWARE AFFECTED: Korn shell, mv, cp, and tempd.

IMPACT: CSIRO users executing ksh mv or cp at the time of the change may have problems. We will attempt to hold all running jobs. Any jobs the cannot be held may fail.

DETAILS: The local correction LC310082 replace the cp and mv commands and enable the -b option. LC310083 replaces ksh and is a bug fix. The tempd daemon will be enabled and will start monitoring and cleaning up /ltmp. These changes have been made to SX611 and have been in place for 1 week already with no problems reported.

CONTACT: J.Stern

NOTICE APPROVAL: Rob. Bell.

8. SYSTEM CHANGE NOTICE 2004-B002 SXCross structure change - reminder to users

TARGET DATE OF CHANGE: 2004-07-26 10:30 AEST

SYSTEMS AFFECTED: Gale

SOFTWARE AFFECTED: Software in /SX/SUPERUXR131 and /SX/SUPERUXR121

IMPACT: Users who have not started using the new sxcross structure may lose or have incorrect settings e.g. paths, environment variables, etc.

DETAILS: Users have been urged to to use the new cross environment as per previous notices (see hpcbull 118). In particular System Change Notice 2004-02.

The old cross structure used non standard locations for libraries, includes, and other related software. This is the final stage in the standardization of the cross environment.

As previously advised the directories /SX/SUPERUXR121 and /SX/SUPERUXR131 will be made unavailable as they are not part of the new structure.

Users are referred to http://www.hpsc.csiro.au/hpccc/userdocs/ug-sx.shtml section "Using sxcross" and also hpcbull 118.

Users should check their scripts etc and make sure they have updated their environments per the details in the hpcbull and user documentation referenced above.

CONTACT: J.Stern

NOTICE APPROVAL: HPCCC Manager

9. Interactive use of the SX-6 nodes

We recently had a case where a user did some tests on an SX-6 node by logging in and running tasks interactively.

Unfortunately, upon logout the user left behind some processes, including one which continued to increase in size. Eventually, it occupied nearly all the node memory, and terminated, but not before it caused other users' jobs to crash. Then another process with similar characteristics started again!

Please do not use the SX-6 nodes interactively for tests.

This has and can result in SX-6 operational problems, caused by various programming errors. Any potential system impact can be easily avoided through use of the batch facilities as outlined in the Local Userguide for the SX-6 Cluster at http://intra.hpsc.csiro.au/userguides/sx/

Everyone's cooperation is solicited to use the SX-6 only as a batch system, excepting when interactive debugging with TotalView, pdbx, sdb, etc is necessary. Note that a "quick test" is not an acceptable interactive debugging session.



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement