Bulletin 191 - 2009 September 17

  1. Preparation for departure of SX-6
  2. SUN Constellation System at BoM
  3. SUN Supercomputer at NCI National Facility
  4. CSIRO Data Store
  5. CSIRO Data Store - Over-writing files
  6. CSIRO cluster - burnet and nelson storage reconfiguration: downtime
  7. CSIRO ASC Software Upgrades and Default Changes

1. Preparation for departure of SX-6

As mentioned in the previous HPCbull - the SX-6 will be replaced with SUN Constellation systems at the BOM and at NCI-NF.

1.1 CSIRO users - migration must be completed by early December

Migration to NCI-NF, ASC shared systems (burnet,cherax) or other partner facilities, should be completed by early December. After this time, the CSIRO SX-6 nodes (sx600-sx604) will no longer be available.

This will reduce CCF power consumption and hence allow sufficient power to run production codes in parallel on both the SUN constellation and SX-6.

As mentioned in item 3 (below), the new SUN Supercomputer at NCI-NF is expected to be available before the end of this month.

1.2 BoM/CAWCR users - migration to flurry and NCI-NF

The flurry cluster at BoM has a very similar environment to the BoM SUN Constellation system. The operating system is very similar and the same batch scheduler (SGE), compiler suites (SUN and Intel) and MPI (OpenMPI) libraries are available.

The NCI-NF's XE and SUN Constellation systems are available to CAWCR users through the CSIRO partner share. See item 3 (below) for details on how to get access.

Please contact the helpdesk (hpchelp@csiro.au or hpchelp@hpccc.gov.au) ASAP if you require assistance, and we will help you to get prepared for the departure of the SX-6.

[ page top ]



2. SUN Constellation System at BoM

Migration of BoM production codes (NMOC Operational systems) to the SUN supercomputer is well underway on the Exemplar porting system (small scale version of the new supercomputer) and on flurry.

Most elements of most systems have been tested and checked however the testing by NMOC and CAWCR staff cannot be exhaustive due to restrictions set by the small scale systems available.

[ page top ]



3. SUN Supercomputer at NCI National Facility

The following announcement was recently made to all NCI National Facility users:

"As previously emailed, the NF's SGI Altix (AC) will end its production service in a couple of weeks to be replaced by the first stage of the new Sun service. The approximate timeline (within a day or two) for this transition is:

  • Sept 18: Users to get access to the new system
  • Sept 22: AC compute service terminates
  • Oct 5: AC filesystems are no longer available

We apologise that these times are a bit imprecise at the moment - we will contact you again when the schedule is clearer and with login details etc. Unfortunately, we are progressing on a very tight timeline which will allow very little overlap of the old and new service. Although the user environment of the new system will be similar to AC, the architecture of the two systems is quite different. They have different CPUs and instruction sets, interconnects, MPI libraries, global filesystems etc. But as has been mentioned previously, the new system is, architecturally, very similar to the XE system. If you have tested your applications on XE, you can be confident of making a smooth transition.

We will transfer your AC home directory (minus all executables and object files) to a subdirectory of your home directory on the new system. We will endeavour to transfer /short files as well. If this is not feasible, there will be a period of almost two weeks where users can transfer their own data. Our preference is that you only transfer what is needed for running on the new system and archive to MDSS or your home institution other files. Any cleanup of /short deleting unnecessary files before the transition would be appreciated.

The scaling of allocation SUs is yet to be determined but the current plan is to set 1 SU = 1 walltime cpu hour on the new system and lower the charging rate on XE."

CSIRO (including CAWCR) access to NCI facilities at the Australian National University can be obtained through either the CSIRO partner share or the NCI Merit Allocation Scheme (MAS).

Applications for CSIRO Partner time are approved by Rob Bell. This application can be accessed at https://nf.nci.org.au/accounts/projects_new/partner_user.php

For more information about the NCI Merit Allocation Scheme please read https://nf.nci.org.au/accounts/projects_new/MAC.php

[ page top ]



4. CSIRO Data Store

We have been experiencing several problems with cherax, the CSIRO Data Store host in recent weeks.

Firstly, there have been several cases of freezes of the system, with the number of processes waiting for i/o exploding. We have had to re-boot the system twice to recover. We have not yet identified the root cause of this issue, but are monitoring the system closely.

Secondly, there has been a huge increase in the amount of data being ingested into the system, culminating in 20 Tbyte being ingested in 3 days over the weekend 4th-7th September. This swamped the main 6.6 Tbyte /cs/datastore area, and resulted in the on-line copies of most files being removed.

On Monday morning 7th September, nearly everyone found that hardly any of their migratable files (>64 kbyte) were on-line any more, and there were thousands of recalls. This then overloaded the 12 Tbyte of cache disc, which averaged over 90% utilisation from about 07:00 to 14:00.

We have asked some of the bigger users to moderate their usage. We are developing tools to better identify the big usage.

We are also working on plans to increase the capability of the storage system to meet the needs.

We have for several years arranged for the main disc to have about 3 Tbyte of free space each business day morning, in the hope that this would be sufficient to last the day. This is no longer the case, and so tape writes need to occur during the day, and this slows down tape recalls.

The Data Store Userguide at http://intra.hpsc.csiro.au/userguides/ds/ has been updated to include recent changes, such as the useful dmget -a option.

[ page top ]



5. CSIRO Data Store - Over-writing files

To add to the DMF recall load, we have found in recent weeks an increasing number of file recalls where a file is recalled and subsequently over-written, thus wasting the recall. We reported previously (HPCbull 189, item 8) this happening with utilities like scp and rcp, but it appears more widespread. System calls appear to open the file and then truncate its length to zero in several circumstances.

When designing processing of large numbers of files, or large quantities of data, please arrange to delete and recreate files rather than over-write them.

[ page top ]



6. CSIRO cluster - burnet and nelson storage reconfiguration: downtime

We have experienced problems with the storage services on the burnet and nelson clusters. These have led to several interruptions to service, with the storage nodes having to be re-booted to recover.

We plan to re-configure the storage system, and move from using the GPFS to using NFS to provide storage services. In addition, we will use a new disc subsystem, and provide greater and more consistent performance on the $HOME and $DATADIR areas.

There will be outages on the Saturdays 26th September and 3rd October to allow the changes to be made, and to move all the files to the new disc system. A further outage may be needed on Saturday 10th October. Reservations will be put into the batch system in advance, so that no jobs will be running at the time of the outages.

In order to reduce the amount of data to be moved, we reserve the right to flush the $WORKDIR area and remove files not accessed or modified in the last 28 days at the time of the moves. Any other files that you can safely remove from $HOME, $WORKDIR and $DATADIR will help to shorten the down-times.

With NFS, there may be some changes in the behaviour seen by applications (as seen under GFS on the SX-6/TX7 system). In particular, file operations are not strictly synchronised. For example, although from one node you might write to a file, the change may not be immediately visible from other nodes.

[ page top ]



7. CSIRO ASC Software Upgrades and Default Changes

The following have been recently installed:

  • Intel Math Kernel Library 10.2.1.017 for Linux (burnet, cherax)

    New features include:

    • LAPACK 3.2 support (238 new LAPACK functions)
    • Usability/Interface improvements
    • Performance improvements, including additional multi-threading support
    • FFTW3 interface now integrated directly into the main libraries

    For more information and usage instructions please see the software map - http://nf.nci.org.au/facilities/software/index.php?site=CSIRO

    The following software will have the default version changed on or after the 5th of October:

    • R - default upgraded to 2.8.1 on cherax (was 2.4.0) and burnet (was 2.5.1)
    • Octave - default upgraded to 3.0.0 on cherax (was 2.1.71)

    The default version of software can be loaded by specifying the software name without the version number in the 'module load' command.

    For example:
      module load R
    

[ page top ]





BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement