|
Bulletin 151 - 2005 December 08
1. SX-6/TX7 batch system problem On the evening of Wed 7th December, a file system filled on the host machine for the NQSII batch server. This prevented jobs being submitted for execution, and outputs may have been lost. Please re-submit jobs. [ page top ] 2. Use of /tmp file systems In general, it is not advisable for users to make use of the /tmp area on HPCCC and CSIRO HPSC systems. On the SX-6/TX7 system, it is two orders of magnitude smaller than the available $WORKDIR and $TMPDIR areas. It has world read and write permissions necessary for system utilities. Inadvertently, files could be removed, if the permissions do not prevent it. Hence, in almost all cases, usage of the areas referenced by $TMPDIR (for job- and session- temporary files) and $WORKDIR is preferable. Where you are aware of utilities using /tmp, HPCCC recommends using alternatives or reconfiguring them to use other areas. [ page top ] 3. Re-running of jobs For jobs run on the HPCCC and CSIRO HPSC systems, when there is a system problem with a job (e.g. a crash or scheduled restart), then the default is to re-run the job, from the beginning. This can be unfortunate for some jobs - at best, work already done is re-done, or at worst important files are over-written. The behaviour can be over-ridden with the -r n flag. The qsub man page includes: -r y|n Declares whether the job is rerunable. See the qrerun command. The option argument is a single character, either y or n.
If the argument is "y", the job is rerunable. Ideally, all jobs would be re-runnable, and codes would be set-up so that they write out restart data regularly, and have the code look for the latest restart data to start from). On some systems with checkpointing capability, queues can be set up so that jobs are periodically checkpointing, to guard against loss of work in the event of a system interruption. This can be selected for the SX-6s. Unfortunately, Linux does not provide a checkpointing capability (yet), which would allow us to take regular checkpoints of jobs, and when there are interruptions, resume from the last checkpoint. This is the case for the TX7s, the Altix and the IBM clusters. [ page top ] 4. SX-6/TX7 Systems interruptions There will be interruptions to service on the TX7 systems on 13th-14th December, to allow for patch installations and upgrades. It is planned that one or other of mawson or eccles will be available almost all the time. [ page top ] 5. CSIRO HPSC job vacancies CSIRO HPSC is currently advertising internally for three positions - one applications specialist, and two system administrators. Information is available on the CSIRO intranet under 'jobs central'. If suitable applicants are not found internally to CSIRO, then these positions will be advertised externally. [ page top ] 6. Quotas on the CSIRO IBM cluster burnet On 22 November, user quotas were imposed on the shared file systems on burnet.
/cs/home/group/user is the user's home directory space, and is backed up to the datastore on cherax. It can be accessed using the $HOME environment variable. /work/user and /cs/data/user are not backed up to the datastore, and users need to copy any data that needs long term archiving to the datastore. /work/user will also be subject to flushing when it becomes full. /work/user can be accessed using the environment variable $WORKDIR, and /cs/data/user can be accessed using the environment variable $DATADIR. A user's home directory on the datastore can be accessed using the environment variable $STOREDIR. For further information, please contact Polly Morgan. [ page top ] 7. Queues on the CSIRO IBM cluster burnet The queue configuration on burnet has been changed. The plan is to restrict usage of the scarce resources (the high-memory nodes) to shorter periods. There motivation of the changes is to allow timely access to nodes for jobs which have higher resource requirements, which are otherwise disadvantaged by the scheduler as they are less likely to be able to be scheduled into free nodes as they arise. To this end there are three main changes:
We have had occasions when a small number of users have submitted jobs lasting for many weeks to the high-memory nodes. We feel this is not a reasonable amount of time for other users to wait for a scarce resource. Indeed it is likely to be greater than the average time between failures of the cluster. Note that this change was meant to have occurred a while ago but there was a minor oversight in the longlow queue limits and the change has not been effective until recently. From now you can expect to see high memory jobs in the longhigh queue unless they are short enough for the express or short queues. Any new jobs submitted to 'world' with high memory limits and too long walltime will end up in the 'seekhelp' queue and not run without further intervention. [ page top ] 8. Mathematica 5.2 (CSIRO HPSC only) A new improved version of Mathematica is now available (as /usr/local/bin/mathematica) on cherax and farrer. For cherax this now provides large memory capability and full 64 bit numerical operations. From the wolfram.com website, a summary of new features includes:
There is only a single user network licence available for CSIRO. Please quit/exit when finished to allow others access. [ page top ]
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |