Bulletin 117 - 2004 Jun 22

  1. Upgrade to SUPER-UX ksh, checkpointing and job migration
  2. Upgrade to GFS
  3. Throughput and Performance on the SX-6 system: reminders
  4. SX-6 MPI and MPI2
  5. Upgrade to the SGI Altix (cherax)
  6. New batch system on the SGI Altix (cherax)
  7. Sam/SamFS outage
1. Upgrade to SUPER-UX ksh, checkpointing and job migration

An upgrade to the Korn shell was recently installed on all the SX-6 nodes.

This upgrade corrected a problem when jobs resumed from a checkpoint: the jobs would fork, so that the shell would continue even though preceding tasks had not completed.

This now allows for more reliable checkpointing.

The HPCCC will now consider invoking job migration in the scheduler, so that jobs on busy nodes can be checkpointed and moved to less busy nodes automatically.

2. Upgrade to GFS

NEC has supplied a kernel patch to allow GFS file systems to be mounted with an option to force synchronous writes to GFS disc. This will provide greater integrity and higher resiliency for the GFS file systems.

The installation will require a re-boot of SX-6 nodes. Implementation is expected between one to two weeks from now.

Tests showed that with the synchronous option, GFS I/O may slow down by between 0-5%. The HPCCC regards this slowdown as acceptable for the gain in reliability.

3. Throughput and Performance on the SX-6 system: reminders

The SX6 is a cluster supercomputer. Like any cluster, applications performance and throughput of parallel jobs is as dependent on system scheduling, and "fitting" the application to the system, as it is from traditional optimisations. GFS I/O performance is complex in its own right, and certain system level actions can improve I/O on a case by case basis.

All users experiencing performance or throughput problems, and all users with new production (especially real-time) applications, are strongly encouraged to:

  1. consult with HPCCC staff at the design stage to discuss processors, I/O, application configuration, and scheduling issues that may be relevant
  2. arrange/conduct monitoring and performance data collection for new applications pre- and post-deployment so that consultative tuning to best fit your application onto the system can be done.
  3. always remember that system decisions can have as much or more impact on applications performance as your personal optimisation work; it is important to maintain the broad, big picture and recognise that in some cases decisions which limit or possibly even reduce applications performance might actually provide better overall outcomes because of systems and scheduling tradeoffs

Key performance issues today are: system scheduling (HPCCC), I/O (HPCCC & programmers), MPI sophistication (programmers), and finally application code optimisations (programmers). The only absolute useful metric is elapsed time to completion, or for multiple jobs, the throughput; all programmer optimisations should try to assess improvements to elapsed time, with specifics such as Gflop/s, etc, being secondary. Both elapsed time and Gflop/s can go down at the same time, as an example.

4. SX-6 MPI and MPI2

All users having MPI coded applications should ensure all tasks are linked using the R13.1 system and MPI libraries. MPI tasks created with previous versions are known to result in various error and problem conditions.

Pre R131.1 MPI and system libraries are not always compatible with R13.1. If you have an MPI operational problem, please ensure you have used the new system for all routines and tasks, prior to reporting a problem.

This same situation can occur with any system or MPI/MPI2 level upgrade. Please note when the SX-6 operating system is upgraded, or a new level of MPI/MPI2 is announced, you may again have to relink for your MPI application to function properly.

Typical problems will be MPI initialisation failures, task stalls, or newly discovered communications failures on applications that previously functioned properly.

5. Upgrade to the SGI Altix (cherax)

Later this week, it is planned that cherax will be upgraded to 64 processors and 120 Gbyte of memory to support the expanding demand.

This will involve one or two lengthy interruptions, since the system has to be moved from its present location and joined to another cabinet.

The first of these outages is planned for tomorrow, Wednesday 23rd June, starting at 12:00 local time for the remainder of the day.

Users will be notified via messages of the day and wall messages of any further interruptions.

Please note also that systems such as farrer (the portal), which rely on cherax for file services, will also be unavailable.

Tasks running on the TX-7s will not be able to send outputs to cherax during the interruption.

We hope the gain will be worth the pain.

6. New batch system on the SGI Altix (cherax)

As advised in HPCbull 116, item 4, a new batch system has been installed on cherax to support the planned 64 processors.

Please switch to the new batch system by entering the command.

 pkgenv torque

The older PBS Pro system will not be available after the upgrade because of licence restrictions. All jobs running or queued in the old system at the time of the upgrade will be lost.

7. Sam/SamFS outage

System Change Notice: 2004-0706.08
Date and Time of Change: 06/07/2004 from 08:30 to 11:00 AET
Systems Effected: Samsrv2, sam
Software Effected: -
User Action Required: Sam will be unavailable.

During the outage an additional 9940B tape drive will be integrated into Sam and the filesystems will be checked.

During the last outage(s) 13th-18th June 2 additional T9840A tape drives were added to Sam. Currently Sam has access to 5 T9940B tape drives and 4 T9840A tape drives.



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement