Bulletin 151 - 2005 December 08

  1. SX-6/TX7 batch system problem
  2. Use of /tmp file systems
  3. Re-running of jobs
  4. SX-6/TX7 Systems interruptions
  5. CSIRO HPSC job vacancies
  6. Quotas on the CSIRO IBM cluster burnet
  7. Queues on the CSIRO IBM cluster burnet
  8. Mathematica 5.2 (CSIRO HPSC only)

1. SX-6/TX7 batch system problem

On the evening of Wed 7th December, a file system filled on the host machine for the NQSII batch server. This prevented jobs being submitted for execution, and outputs may have been lost. Please re-submit jobs.

[ page top ]


2. Use of /tmp file systems

In general, it is not advisable for users to make use of the /tmp area on HPCCC and CSIRO HPSC systems.

On the SX-6/TX7 system, it is two orders of magnitude smaller than the available $WORKDIR and $TMPDIR areas.

It has world read and write permissions necessary for system utilities. Inadvertently, files could be removed, if the permissions do not prevent it.

Hence, in almost all cases, usage of the areas referenced by $TMPDIR (for job- and session- temporary files) and $WORKDIR is preferable.

Where you are aware of utilities using /tmp, HPCCC recommends using alternatives or reconfiguring them to use other areas.

[ page top ]


3. Re-running of jobs

For jobs run on the HPCCC and CSIRO HPSC systems, when there is a system problem with a job (e.g. a crash or scheduled restart), then the default is to re-run the job, from the beginning. This can be unfortunate for some jobs - at best, work already done is re-done, or at worst important files are over-written.

The behaviour can be over-ridden with the -r n flag. The qsub man page includes:

-r y|n Declares whether the job is rerunable. See the qrerun command.

The option argument is a single character, either y or n.

If the argument is "y", the job is rerunable.
If the argument is "n", the job is not rerunable.
The default value is 'y', rerunable.

Ideally, all jobs would be re-runnable, and codes would be set-up so that they write out restart data regularly, and have the code look for the latest restart data to start from). On some systems with checkpointing capability, queues can be set up so that jobs are periodically checkpointing, to guard against loss of work in the event of a system interruption. This can be selected for the SX-6s.

Unfortunately, Linux does not provide a checkpointing capability (yet), which would allow us to take regular checkpoints of jobs, and when there are interruptions, resume from the last checkpoint. This is the case for the TX7s, the Altix and the IBM clusters.

[ page top ]


4. SX-6/TX7 Systems interruptions

There will be interruptions to service on the TX7 systems on 13th-14th December, to allow for patch installations and upgrades. It is planned that one or other of mawson or eccles will be available almost all the time.

[ page top ]


5. CSIRO HPSC job vacancies

CSIRO HPSC is currently advertising internally for three positions - one applications specialist, and two system administrators. Information is available on the CSIRO intranet under 'jobs central'.

If suitable applicants are not found internally to CSIRO, then these positions will be advertised externally.

[ page top ]


6. Quotas on the CSIRO IBM cluster burnet

On 22 November, user quotas were imposed on the shared file systems on burnet.

  • /cs/home/group/user (soft quota 20GB, hard quota 25GB)
  • /work/user (soft quota 40GB, hard quota 50GB)
  • /cs/data/user (soft quota 40GB, hard quota 50GB)

/cs/home/group/user is the user's home directory space, and is backed up to the datastore on cherax. It can be accessed using the $HOME environment variable.

/work/user and /cs/data/user are not backed up to the datastore, and users need to copy any data that needs long term archiving to the datastore.

/work/user will also be subject to flushing when it becomes full. /work/user can be accessed using the environment variable $WORKDIR, and /cs/data/user can be accessed using the environment variable $DATADIR.

A user's home directory on the datastore can be accessed using the environment variable $STOREDIR.

For further information, please contact Polly Morgan.

[ page top ]


7. Queues on the CSIRO IBM cluster burnet

The queue configuration on burnet has been changed. The plan is to restrict usage of the scarce resources (the high-memory nodes) to shorter periods.

There motivation of the changes is to allow timely access to nodes for jobs which have higher resource requirements, which are otherwise disadvantaged by the scheduler as they are less likely to be able to be scheduled into free nodes as they arise. To this end there are three main changes:

  1. The aggregate number of cpus available to jobs which will fit into the memory of the lower specification nodes has been limited to somewhat less than the total number of cpus in the cluster (such jobs run in the longlow queue). Such jobs do not currently have any walltime limits.
  2. The choice of nodes has been tuned to favour allocation of low memory nodes first (where possible).
  3. The walltime for high memory jobs (longhigh queue) has been limited to 12 hours.

We have had occasions when a small number of users have submitted jobs lasting for many weeks to the high-memory nodes. We feel this is not a reasonable amount of time for other users to wait for a scarce resource. Indeed it is likely to be greater than the average time between failures of the cluster.

Note that this change was meant to have occurred a while ago but there was a minor oversight in the longlow queue limits and the change has not been effective until recently. From now you can expect to see high memory jobs in the longhigh queue unless they are short enough for the express or short queues.

Any new jobs submitted to 'world' with high memory limits and too long walltime will end up in the 'seekhelp' queue and not run without further intervention.

[ page top ]


8. Mathematica 5.2 (CSIRO HPSC only)

A new improved version of Mathematica is now available (as /usr/local/bin/mathematica) on cherax and farrer. For cherax this now provides large memory capability and full 64 bit numerical operations.

From the wolfram.com website, a summary of new features includes:

  • support for 64-bit addressing
  • Multicore support on major platforms
  • Multithreaded numerical linear algebra
  • 64-bit-enhanced arbitrary-precision numerics
  • Vector-based performance enhancements
  • New algorithms for symbolic differential equations
  • Enhanced performance for linear Diophantine systems
  • Enhanced quadratic quantifier elimination
  • Singular-case support for high-level special functions
  • Enhanced statistics charts

There is only a single user network licence available for CSIRO. Please quit/exit when finished to allow others access.

[ page top ]



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement