Bulletin 158 - 2006 July 06

  1. New brochure - HPCCC and BLUElink
  2. SX-6 scheduling changes: stalled jobs, and less churning
  3. Interactive use of cluster nodes (SX-6 and other)
  4. cherax upgrade to SUSE operating system
  5. CSIRO Data Store - loss of files
  6. SX-6 Fortran: Workaround for incorrect MPIPROGINF output
  7. Burnet Head node outage: Tuesday 11th July 9:00am-11:00am

1. New brochure - HPCCC and BLUElink

A new HPCCC Brochure featuring the BLUElink project has been released.

A pdf download is available from www.hpccc.gov.au/about, and hardcopies are available on request.

Quantities have been dispatched to BMRC, CMAR Aspendale, CMAR Hobart, and CSIRO corporate.

[ page top ]


2. SX-6 scheduling changes: stalled jobs, and less churning

We have had several recent occurrences when jobs have stalled for long times on the SX-6s, thus tying up resources for long periods. Mechanisms to detect such stalls have been strengthened.

We need to allow for some job stalls while data is being retrieved from data stores (although it is best to run such jobs on the TX7s rather than the SX-6s).

The Enhanced Resource Scheduler (ERS) has a feature to hold jobs that have made no progress for some time.

We have set ERS to hold jobs that have made no progress for 3 hours or more (the time was set to 8 days).

We have also changed the ERS HOLDJOBRankingDifference parameter from 2 to 50, so that new jobs have to be of very significantly higher priority to displace existing jobs of the same ERS class (SPECIAL or NORMAL). This should reduce the extent of job churning seen recently, The erstaj command shows job priorities. See man erstatj for options - -aRfn shows maximum information.

[ page top ]


3. Interactive use of cluster nodes (SX-6 and other)

We have discouraged interactive use of SX-6 nodes, since such usage can ruin the performance of tightly coupled multi-CPU and multi-node jobs.

However, interactive use is sometimes needed for debugging, and for rapid development.

The best way to access SX-6 nodes is to use a batch job to set up an X-terminal window. That way, the scheduler knows about the intended resource usage, and can decisions about where to place the request, and make allowances in its resource allocations.

The SX-6 cluster userguide at http://www.hpccc.gov.au/hpccc/userguides/sx/ has a section on "Interactive NQSII Jobs" which describes use of the local utility xterm_batch from the TX7s. Make sure that X is working from the TX-7s with a command such as xclock, xlogo or xterm. If it is, then xterm_batch should give you an SX-6 xterm window when the resulting queue job runs.

Please use this utility instead of logging into SX-6 nodes.

On the Altix and IBM clusters, users can use the command qsub -I to request an interactive job.

[ page top ]


4. cherax upgrade to SUSE operating system

A test partition on the Altix has been set up to run the SUSE operating system.

This test partition will be made available to users from Monday 17th July - simply log in to cherax-1 instead of cherax.

Your $HOME file system from cherax will be visible using NFS, but DMF functionality like dmgets will not be available.

We plan to use the modules facility to provide access to software on cherax-1.

Users are asked to test their applications on cherax-1, and report any problems.

We plan to cut over production to SUSE on the weekend of 4th-7th August. cherax will be unavailable for much of the weekend while the upgrade is in progress.

[ page top ]


5. CSIRO Data Store - loss of files

At the beginning of May, the /cs/datastore filesystem failed due to the detection of a memory error and a reboot of cherax was required. Somehow, (still under investigation), DMF a few days later thought that it had written some files to tape when in fact this hadn't happened, and subsequently removed the data from online disk

.

Out of more than 6 million files on the system, 217 were lost.

Users whose files were lost have been notified.

[ page top ]


6. SX-6 Fortran: Workaround for incorrect MPIPROGINF output

When the option -Wf"-pvctl res=whole" is specified, there is a restriction in the current release that MPIPROGINF cannot be used. Adding "!cdir release" (or #pragma cdir release) just before the call to MPI_FINALIZE will prevent this problem.

[ page top ]


7. Burnet Head node outage: Tuesday 11th July 9:00am-11:00am

Burnet's head node will be unavailable on Tuesday 11th July from 9:00am to 11:00am, for security and other upgrades and for filesystem maintenance. Jobs that need to communicate with the head node may be lost and may need to be resubmitted once the outage is over. For more details contact Polly Morgan, (03) 9669 8171, polly.morgan@csiro.au.

[ page top ]



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement