Bulletin 177 - 2008 June 10

  1. Supercomputer RFT
  2. Large Scale Data Storage System
  3. CSIRO Datastore 1000 Tbyte milestone
  4. SX Cross Environment Modules
  5. BOM SX scheduling change
  6. Long-running jobs: re-runs and file backups
  7. CSIRO Cluster Merge and Refresh
  8. CSIRO Altix (cherax) upgrade
  9. CSIRO Web page updates
  10. CSIRO New Clayton Cluster System
  11. CSIRO New Software on cherax

Note: "CSIRO" items can apply to BoM users of cherax and burnet


1. Supercomputer RFT

The Bureau of Meteorology through the HPCCC and the Australian National University, in conjunction with CSIRO, have issued a tender for new supercomputer systems. This initiative is a part of the National Computational Infrastructure (NCI).

The tender closed on 29th May.

Details can be obtained via through the Australian Government Tender Website.

[ page top ]



2. Large Scale Data Storage System

The Bureau of Meteorology has signed a contract with Sun for the delivery of a Large Scale Data Storage System. The contract provides for the delivery of equipment and services.

A new 10,000 slot StorageTek SL8500 tape library has been installed in the Central Computer Facility (CCF) at 700 Collins St, with T10000 and LTO4 tape drives. Disc units have also been delivered.

CSIRO Advanced Scientific Computing (ASC) has also ordered T10000 tape drives, media and tape library, to continue to support the expansion of the CSIRO Data Store at the HPCCC.

Below is a comparison of existing and new tape drives.

DRIVE CAPACITY (Gbyte) TRANSFER RATE (Mbyte/s) AVERAGE ACCESS TIME
T9940B 200 30 59 sec
T9840C 40 30 12 sec
T10000 500 130 62 sec
LTO 4 800 120 57 sec

(capacity and transfer rates are for native data (uncompressed))
(Access time is load time plus time to search from beginning of tape to midpoint.)

[ page top ]



3. CSIRO Datastore 1000 Tbyte milestone

On Monday 19th May, the total holdings in the CSIRO Data Store reached 1000 Tbyte for the first time. Four years ago, only 48 Tbyte was being managed. The amount of data stored has more than doubled each year.

The Bureau holds about 1500 Tbytes.

[ page top ]



4. SX Cross Environment Modules

New installations of software for the SX cross environment will be accessible *only* via Environment Modules. The pkgenv and sxcross utilities will continue work for existing software, but will not be supported for newly installed software. For more information about configuring your SX cross environment read the Cross-systems section of the SX-6 user guide at http://www.hpccc.gov.au/hpccc/userguides/sx/

[ page top ]



5. BOM SX scheduling change

As advised in HPCbull 175, we wish to trial a new way of allocating jobs to CPUs on the SX-6 nodes, as already trialled on the CSIRO nodes for several months.

To enable the trial to be carried out, please immediately reduce the maximum number of CPUs per node requested by jobs from 8 to 7.

        #PBS -l cpunum_prc=7
        #PBS -l cpunum_job=7

Please also stop requesting more than one CPU for jobs going to the bm queue.

Trials are then scheduled to be carried out from Tuesday 1st July.

[ page top ]



6. Long-running jobs: re-runs and file backups

We had a recent problem on one of the HPCCC/ASC systems, and a batch job aborted. The default behaviour of the batch system in such circumstances is to restart such a job from the beginning. In this case, the job re-run over-wrote files that could have been usefully used from the previous run.

To avoid this behaviour, specify the parameter #PBS -r n

A further issue was revealed in this incident: files which are open to jobs are not caught by most of the backup processes.

The user asked us to restore files from before the rerun, but even though the files were in a backed-up area ($HOME), there were no backups.

It is often better to run jobs in smaller sub-runs, to put output files in $TMPDIR or $WORKDIR areas, to break output files up into small chunks and close the files after they are written to, and to copy the files to a backed-up or migrating file system when the job is complete.

Since files in $TMPDIR are removed at the ends of runs, special measures need to be taken (shell traps) to save files from there in the event of a job error.

[ page top ]



7. CSIRO Cluster Merge and Refresh

A new system based on a merge and refresh of the existing cluster systems at Docklands (burnet and nelson) to provide larger, higher-performance and more reliable storage is nearly ready for initial user access.

  1. Hardware Environment

    The integration of the new hardware into our existing infrastructure is well underway. The new management and storage nodes are configured, functioning and ready to start migrating nodes from the existing clusters.

  2. Software Environment

    The migration and porting of user applications to the new SLES 10 operating system is progressing well. See the burnet-usr section of: http://intra.hpsc.csiro.au/user/pkginfoweb/

More information will follow soon about the migration of users.

[ page top ]



8. CSIRO Altix (cherax) upgrade

CSIRO ASC has ordered a replacement for the Altix system (cherax): a new SGI Altix 4700 system, with faster processors with bigger cache, an upgrade from 244 Gbyte to 512 Gbyte of memory, and a faster interconnect. The new system will occupy only one rack instead of the current four, and will use about half the power.

There will be several interruptions to service to bring the new system on-line. These interruptions will start with a down-time on cherax on Wednesday 18th June, when the existing cherax system will be reduced to only 64 processors, to make room for the new system.

Successive down-times over subsequent Wednesday evenings will take place until the change-over is complete.

[ page top ]



9. CSIRO Web page updates

The CSIRO HPSC WWW pages have been re-badged with the new name - ASC (Advanced Scientific Computing). Relevant existing information has been retained. New information is presented on the CSIRO ASC Computational Bioinformatics Facility and the CSIRO ASC Condor infrastructure for cycle harvesting of desktop PC resources. Information about other facilities available has also been updated.

The information on software has been restructured to make it easier to find, particularly information about charging for software (we don't) and using ASC licensed software on computers elsewhere in CSIRO. The main software information page is at http://intra.hpsc.csiro.au/software/

Please see the WWW pages for more information.

[ page top ]



10. CSIRO New Clayton Cluster System

CSIRO ASC has recently completed the installation of a new cluster system at Clayton to meet the immediate needs of research groups located at CSIRO Minerals, but also for use of the wider CSIRO research community. In addition, a new storage system has been ordered that will provide 20 Tbyte of disc space for multiple hosts.

[ page top ]



11. CSIRO New Software on cherax
  • Octave, an alternative to Matlab has been installed on cherax. The current versions of Octave support large array access (64 bit addressing).

    To access this version of Octave do:

       module load octave/2.9.12-64
    
  • CDAT 5.0

    Climate Data Analysis Tools (CDAT) has been installed on cherax.

    To access this version of CDAT do:

       module load cdat/5.0.0.alpha1
    

[ page top ]




BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement