Bulletin 134 - 2005 Mar 11

  1. HPCCC - SX-6/TX7 upgrade acceptance
  2. HPCCC - NQS II and ERS II upgrades, scheduler and batch changes
  3. HPCCC SX-6/TX7 GFS interruptions
  4. HPCCC - SX-6 - more accounting information
  5. CSIRO - service levels - processing and data
  6. cherax updates
  7. Scheduled shutdowns - cherax and clusters
  8. New CSIRO HPSC staff member
  9. Speeding up applications
  10. Catching errors in shell scripts - avoiding disasters

1. HPCCC - SX-6/TX7 upgrade acceptance

On Monday 21st February, the upgraded SX-6/TX7 system was accepted.

The upgraded system is now an SX-6/28 with 224 processors, a peak speed of 1792 Gflop/s, total processor to memory bandwidth of 7168 Gbyte/s and total interconnect bandwidth of 448 Gbyte/s.

Although the acceptance test is completed, there are still a number of ongoing operational and support projects in progress, which continue to require significant HPCCC staff attention.


2. HPCCC - NQS II and ERS II upgrades, scheduler and batch changes

2.1 New versions of NQS II and ERS II

New versions of NQS II and ERS II will be installed on the SX6/TX7 system and front-ends between 07:30 and 10:30 on Tuesday 22nd March. The new version has support for many new features, including the "-l cpunum_job=ncpus" parameter.

Details of the changes can be found at http://www.hpccc.gov.au/hpccc/userdocs/NQS+ERS.upgrade.2005-03.shtml

The new version of the NQSII User's Guide (for NQS V2.20) is on the hpccc.gov.au WWW page, under "User Documentation". (If you need access to the old manual, please email hpchelp@bom.gov.au.)

2.2 "-l cpunum_job=ncpus" parameter

After the upgrade, users will need to add the "-l cpunum_job=ncpus" parameter to jobs, by 11th April. This will be a core part of the scheduling for all jobs, by enabling scheduling based on the number of processors needed for each job.

We will set a default value for this parameter for each queue, which will be visible on qstat -Q -f queue-name commands.

(Note that for multi-node jobs, the value requests the number of processors on each node, not for the entire request.)

The new qsub wrapper has replaced the previous version, so users no longer need to use the name qsubnew. It will be removed on or after 22nd March.

2.3 Changes in -S behaviour for NQS II.

With the new version of NQS II, the shell selected with the -S parameter will now act as a login shell. This parameter has been rejected by the qsub wrapper. If users need this facility, please contact us.

2.4 Batch job time and memory limits

We are concerned about jobs running for very long periods and then being interrupted. In the near future, batch job time and memory limits will be considered for the non-operational queues on the SX-6/TX7 system.

A maximum job time limit of 1 week is proposed.

As well, default values will be set, so that jobs which omit values for the job time and job memory limits will not default to the maxima, and cause other work to be blocked, or waste too many resources with runaway jobs.

(Because the scheduler is set to favour short development jobs, it consequently dis-favours long jobs, until they have made significant progress. Users with jobs with very long time limits will get low priority for their jobs).


3. HPCCC SX-6/TX7 GFS interruptions

On 16th February, one of the SX-6/TX7 disc units failed, and two file systems were unavailable for about an hour.

The existing disc and controllers on the SX-6/TX7 system need firmware upgrades to correct this problem. There may be some short interruptions on the coming weeks to do this upgrade. These will be in the early mornings.

There was a system interruption on the SX-6/TX7 system from 10:50 on 2nd March for about 10 minutes - GFS file systems were not accessible during this period. A mistake was made in a system patch installation. Our apologies. Measures are being taken to prevent recurrences.


4. HPCCC - SX-6 - more accounting information

At the next reboot of SX-6 nodes, a change will be made to allow the collection of more accounting data - the full command line will be collected for each process.

This will then enhance the "acctcom" command with the "-L" option to display the command line image, and help in diagnosing problems such as that referred to in item 10.


5. CSIRO - service levels - processing and data

CSIRO HPSC is assessing its service levels and continuity, in order to meet the needs of the users.

Would CSIRO users who have activities dependent on the HPCCC/CSIRO HPSC services, which have a time-critical commercial or national importance, please contact us and outline their needs.

We plan to ensure that such processing is given special status, by running in real-time queues, and would be protected during events like a loss of power to the facilities, when there might be a reduced number of SX-6 nodes left running, or only part of a cluster.

As well, CSIRO users should make an assessment of the vulnerability of their data. Only some file systems are backed-up: the home areas on the SX-6/TX7 systems, on cherax and on the cluster head nodes.

Users are notified that there is no longer an off-site backup of the data in the CSIRO Data Store. The volume of data, and lack of network bandwidth currently preclude this. Key information, such as source code, should have copies at multiple sites.

CSIRO HPSC plans to make available soon a facility for users of the CSIRO Data Store to specify whether one or two copies of data need to held in the Data Store: for example, for bulk data copied from other sites, only one copy should be needed, and this will save on tape media costs. Estimated costs for the next year at current growth rates is $0.25M.

The facility would also allow users to specify disc residency for selected files.


6. cherax updates

An upgrade to the SGI Propack package, a patch, and a new version of the torque batch system were installed on cherax prior to 10:00 on Wednesday 23rd February.

In the near future, batch job time and memory limits will be imposed on the cherax queues.

A maximum job time limit of 1 week is proposed.

As well, standard or default values will be set, so that jobs which omit values for the job time and job memory limits will not default to the maxima, and cause problems.

There is no checkpoint capability on cherax, so that we cannot resume running jobs across shutdowns. Users are asked to ensure that any long-running jobs save enough information as they progress so that interrupted long runs are not wasted. HPCCC staff can assist.


7. Scheduled shutdowns - cherax and clusters

To allow HPCCC systems administration staff windows to test and install new software, reconfigure file systems, etc, we will from now on schedule shutdowns of cherax and the two clusters on the second Saturday of each month, for up to 6 hours.

The first of these will be this Saturday 12th March, but for the clusters only. The outage for cherax will be deferred to a following Saturday, probably 19th March, for a Service Pack upgrade to fix some outstanding problems.


8. New CSIRO HPSC staff member

CSIRO HPSC welcomes Neil Killeen, who started on 28th February on secondment from the the Australia Telescope National Facility.

Neil is initially working on some data access problems - in particular, providing WWW access securely to the CSIRO Data Store, and investigating the use of the Storage Resource Broker.


9. Speeding up applications

The HPCCC has been monitoring the performance of jobs through the systems, to look for applications that are not making good use of the machines. For example, we recently detected a job which is run daily which was taking 70-90 minutes, but using only a few hundred seconds of processor time.

The user swapped the application to another file system, and saved 20 minutes. He then analysed the application, and found that it was calling the awk utility thousands of times when a single call could do the work. The application finished up taking only 12 minutes.

As a final optimisation, the job would probably run more quickly on a non-SX-6 platform, and that will be examined.

In another gain, a daily task that used to take about 4 hours has been reduced to about 3 hours - mainly by selecting tailored values of the netCDF buffer size for each type of file used.


10. Catching errors in shell scripts - avoiding disasters

Recently, a user had an error in a job in a bourne/korn shell command sequence like:

 cd $subdir
 rm -rf *

There are two potential problems with this.

If the variable 'subdir' is not defined, the cd command sees no argument and the directory is changed to the user's HOME directory - then all files and sub-directories are recursively removed (except dot files and directories).

If the variable 'subdir' is defined but does not specify a valid directory, the cd will fail and the directory will not be changed - then all files, etc, in the current directory will be deleted.

Fortunately for the user, HPCCC staff were able to recover the lost files from a backup. But remember, only a few of the file systems on the SX-6/TX7 system have backups.

HPCCC staff have been working on recommendations for safer scripting, and will provide these at a later date.



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement