Bulletin 152 - 2006 February 24

  1. SX-6/TX7 Systems interruptions
  2. Modules package
  3. Extension of the NEC SX GFS file system
  4. Seminar - GFS
  5. SX-6 shutdowns
  6. HPCCC Users' Liaison Meeting
  7. New NEC SX manuals
  8. Incident logs
  9. SX cross-compilation and cross-kits
  10. Storage growth
  11. Queue limits on the CSIRO IBM cluster burnet
  12. CSIRO HPSC staff changes

1. SX-6/TX7 Systems interruptions

There will be interruptions to service on the TX7 systems on 27th February (eccles) and 6th March (mawson) to upgrade the operating systems. See Change notice 2006-A003 at http://www.hpccc.gov.au/hpccc/user_news_advice/system_change_notices/

[ page top ]


2. Modules package

In HPCbull 147.11, the HPCCC announced that it was investigating with NEC the feasibility of using the modules package to enable users to select software packages (similar to the pkgenv facility in use on some HPCCC/HPSC systems).

Since then, the HPCCC has received confirmation that NEC will support the use of the modules package for software selection.

There is a WWW page at http://www.hpccc.gov.au/hpccc/userguides/faq/env_modules.php giving preliminary information, and further details will be forthcoming.

[ page top ]


3. Extension of the NEC SX-GFS file system

An SX-GFS client has recently been installed on gale.

This allows direct access to a large number of the file systems from the SX-6/TX7 system.

NEC has provided versions of some of the standard utility programs which have been configured to use large block sizes, so that the i/o is done directly to the FC-connected discs, rather than via NFS over Ethernet links. There are also special library versions to facilitate applications gaining fast access.

See the following item giving notice of a seminar which will give information on the use of the SX-GFS client. We plan also to produce documentation.

One of the key advantages of the direct access to the SX-6/TX7 file system is in visibility. One application used to transfer about 20 Gbyte per run to gale for possible inspection. This transfer has been removed, and files are inspected as needed directly on gale, with perhaps only a few hundred Mbyte looked at. This is a large saving in load, duplicated storage and user inconvenience.

NEC has announced the availability of a similar SX-GFS package for Altix systems. Would cherax users who would like to have direct access to SX-6/TX7 file systems from cherax please contact HPCCC staff to show their interest.

[ page top ]


4. Seminar - GFS

"Performance Computing and working with uGFS on gale"

By John Stern & Ramesh C. Balgovind

10.00 am Wed 8 March 2006 BMRC Lecture Room, 9E

See seminars: http://www.hpccc.gov.au/seminars/

[ page top ]


5. SX-6 shutdowns

Heat shedding procedures were adopted as a precautionary measure during the Christmas break to account for reduced support staff, and had to be employed in January when the local temperatures exceeded 40C over consecutive days.

[ page top ]


6. HPCCC Users' Liaison Meeting

The most recent meeting of the HPCCC Users' Liaison committee was held on 9th February. Bureau and CSIRO staff can see minutes at http://www.hpccc.gov.au/hpccc/meeting/user_liaison/

The recent meeting brought an update on some of the HPCCC plans and goals for 2006. We invite comment from a wider range of users on these plans.

[ page top ]


7. New NEC SX manuals

NEC has recently provided three further new manuals under the manual improvement project for the HPCCC.

The newest manuals are:

  • Performance Tuning Guide
  • NQSII User's Guide
  • C++/SX Programmer's Guide

in addition to the previously provided

  • FORTRAN90/SX Programmer's Guide

and are accessible from

http://www.hpccc.gov.au/hpccc/userdocs/index_user.shtml

The search indexes have been updated to reflect the additional manuals.

Please browse these manuals, and send feedback to the HPCCC.

[ page top ]


8. Incident logs

The trial incident log for cherax at http://intra.hpsc.csiro.au/user/incident_reports/ax/ is proving useful to support staff, and will be the primary place for now for gaining updates on incidents on cherax.

A new place, http://intra.hpsc.csiro.au/user/incident_reports/ax/ has been set up for the CSIRO clusters.

[ page top ]


9. SX cross-compilation and cross-kits

In past years, CSIRO had provided copies of the SX compilers and cross-kits for installation at remote sites: several sites had requested and used the cross-kits in this way.

However, there have been difficulties in maintaining the software at the current levels at multiple sites, and the front-end services at the HPCCC have advanced in performance. CSIRO HPSC is interested in ending support for remote installations of the cross-kit. One site is using the facility by mirroring the version kept on the portal (farrer).

If remote users wish to continue to use the cross-kit at their own site, please contact the HPCCC.

[ page top ]


10. Storage growth

There is now more that 1 Petabyte stored in the data stores managed by the HPCCC - the Bureau's SAM-FS and MARS, and CSIRO's DMF Data Stores. (This counts all the copies - so over 500 Tbyte of primary data.)

In 2005, Bureau users added 120 Tbyte primary to bring the total to 320 Tbyte at the end of December, and CSIRO users added 110 Tbyte primary to bring the total to 150 Tbyte.

In the two years since the CSIRO Data store was moved from Lonsdale St to Collins St, primary data has grown from 17 Tbyte to 180 Tbyte. In recent weeks, there have been several days when over 1 Tbyte primary has been added to the store.

There are significant costs in storing all that data, and in general, the more that is stored, the slower the responses to retrievals, as time and resources are needed to write data, and to do housekeeping - e.g. backups, sparsing and merging of tape contents.

Users are asked to review their holdings, and consider moderating their use and/or remove unwanted data - but please be careful with deletions! Summaries of holdings can be seen from http://www.hpccc.gov.au/hpccc/system_stats/

Could we please ask users to be careful with consolidating many files into tar archives or similar - files of 80 Gbyte or more can strain resources, and lock up tape drives for long periods. On the other hand reducing the number of files helps to maintain performance - the time to do backups depends heavily on the number of files, and with 6 million inodes in the CSIRO Data Store, full backups are not completing overnight.

[ page top ]


11. Queue limits on the CSIRO IBM cluster burnet

There are few absolute limits in the queue system on burnet. The queue structure there is mainly to assist in identifying different types of jobs. Most jobs will be sent to the world (default) queue and routed to the first queue they are eligible for:

express -
< 10 min, < 8 nodes - gets a priority boost
short -
< 2 hr
longlow -
< 1001 Mbyte memory per process - fits into any node
longhigh -
no limits (recent change) - may need high memory node
seekhelp -
no limits, stopped queue - irrelevant while longhigh has no limits...

More information can be found from the "qstat -Qf" command.

The queues all have default limits on vmem and walltime (and nodes=1) so you must specify limits for these if the defaults are unsuitable.

In practice, the limits are actually determined by the scheduler (maui) which will only start a queued job if/when a suitable set of nodes is available. The maui configuration currently sets aside some nodes to be eligible only to shorter jobs and it is possible to submit jobs which will queue but not start for an indefinite amount of time (or never start). The relevant parameters from /usr/local/maui/maiu.cfg are:

# set aside some nodes for short turnaround in the day
 SRCFG[hour2]   STARTTIME=8:00:00 ENDTIME=17:00:00 TASKCOUNT=2 \
  MAXTIME=2:00:00
# set aside some nodes _not_ for very long jobs
 SRCFG[hour6] TASKCOUNT=9 MAXTIME=6:00:00 PERIOD=INFINITE
# block very long jobs from many of the highmem nodes
 SRCFG[highmem]  RESOURCES=PROC:-1;MEM:4000 TASKCOUNT=7 \
  MAXTIME=12:00:00 PERIOD=INFINITE

The upshot is that jobs which specify less than 2, 6 or 12 hours potentially have access to more resources than longer jobs.

[ page top ]


12. CSIRO HPSC staff changes

CSIRO HPSC and the HPCCC welcome:

  • John Giovannis, who commenced on 7th February. John will be providing cluster support for CSIRO HPSC and CSIRO Human Nutrition at Parkville.
  • Daniel Smith, who commenced on 20th February. Daniel will be providing systems administration support, particularly in the areas of security and networking.

CSIRO farewelled Len Makin who retired in December. Bob Smart is on extended Long-Service leave. Finally, Erika Stojanovic left in early January for a position with CSIRO in Clayton.

Applications are being considered for an applications support specialist to replace Len Makin.

[ page top ]



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement