|
Bulletin 138 - 2005 May 13
1. Updates to HPCCC and HPSC WWW pages The new CSIRO HPSC external WWW pages are now at http://www.hpsc.csiro.au/For HPSC users, please use the site http://intra.hpsc.csiro.au/for access to all the user information. The HPCCC site at http://www.hpccc.gov.au/continues as the primary entry for HPCCC information. Please look at the new information and form, and give us feedback, particularly if one of your favorite pieces of information is missing. Please update any bookmarks or links - we intend these sites to be the main entry points for information about the HPCCC, the CSIRO HPSC and for information for users. The locally-written userguides have been updated on the above WWW pages, and are also now available for the first time from http://www.hpccc.gov.au/from the links to User Documentation and Local Userguides. Note also that the userguides are available from:
There have been significant updates to the SX-6, Altix and ia32 cluster guides. The recent round of SX-6 guide changes have been marked in a special version of the sx guide at: http://intra.hpsc.csiro.au/userguides/sx/localguide.20050405.20050428.html 2. Allowing migration of jobs on the SX-6 system Job migration for Bureau non-operational jobs will be enabled on Wednesday 18th May on the bm, bmml and bmmn queues. System change notice 2005-B003 will be issued to address the changes. 3. Change of home base for the req (wreq) problem reporting system The req system was moved to its new location at http://intra.hpsc.csiro.au/cgi-bin/wreq/req?liston 29th April - there were some problems with e-mail to the req system, and these are now mostly solved. If you encounter further problems, please contact HPCCC staff. The req system WWW pages can be accessed from the HPCCC and HPSC users WWW pages. 4. Quotas on the HPCCC SX-6/TX7 file systems The HPCCC has a policy of imposing quota limits on file systems, to lessen the chance of one user's actions adversely affecting other users. Many of the file systems on the SX-6/TX have not had quota limits imposed yet. We plan to impose limits from Wednesday 18th May on the Bureau non-operational user file systems, pointed to by $HOME, $WORKDIR, $DATADIR, $TMPDIR, etc. Limits will be set so that no users' current usage will exceed the limits at the time of imposition. Typically, default limits will be set at 1/4 of the file system size, and 50000 inodes. The HPCCC reserves the right to change quota limits in the future as circumstances change. Use the command quota to see usage and limits. Add the option -v to see usage on file systems which do not yet have quota limits in operation. System change notice 2005-B002 was issued addressing these changes. 5. Limits on queues on the HPCCC TX7s In order to reduce the chances of the job load on the TX7s impacting on the file serving performance, we are revising the limits on jobs. The following limits will be imposed on the non-operational queues for the TX7s - i.e. on the execution queues txbm0, txbm1, txcs0 and txcs1 which are accessed through the routing queues tx, txbm, and txcs. Run limit per queue: 10 for bm, 5 for cs. User run limit per queue: 4 Elapsed time per job: maximum 6 hours, default 1 hour Processor time per job and per process: maximum 1000 s, default 100 s Memory per job and per process: maximum 500 Mbyte, default 100 Mbyte These limits will prevent one user's jobs from taking all the job slots, and reduce the chances of stalled jobs locking out other jobs. If these limits are too severe for your applications, please contact HPCCC staff. Operational queues will continue to be subject to only the total run limit of 10 per queue. Users are requested to set limits appropriate to their jobs. For example: #!/bin/ksh #PBS -j o -r y -q tx #PBS -l elapstim_req=600 #PBS -l cputim_job=80 #PBS -l cputim_prc=70 #PBS -l memsz_job=200MB #PBS -l memsz_prc=190MB # Although the per-job processor and memory limits are checked at submission time, they are not enforced at run-time on the TX7s. Users are asked nevertheless to make reasonable use of these limits to prepare for the future. System change notice 2005-A013 will be issued addressing these changes. 6. New HPCCC qsub local version The qsub command for the HPCCC SX-6/TX7 system was updated before 10:00 on 12th May. There was one reported problem with the script, and this was corrected at 11:40. 7. Common file systems on the HPCCC SX-6/TX7 system The SX-6/TX7 system has two areas for users and systems staff to share files across organisations.
Both of these are GFS filesystems visible over the entire SX-6/TX7 system. The variable $COMMONDIR points to the top level of the /common area, and should be used as a starting point for applications using the area. HPCCC staff have started to use the /common area for items such as utilities, rather than having to distribute them to local disc on all nodes for areas like /usr/local/bin. The command pkgenv commonbin will add directories like /common/bin to your path, enabling access to new utilities. (There are distinct areas for SX-6 and TX7 utilities which are automatically selected by pkgenv). The cos2f77 utility for converting Cray COS-blocked files is available there. If you want to use the /common or /bigcommon area for joint projects across organisations, please contact HPCCC staff. 8.cherax and CSIRO Data Store downtimes and upgrades The CSIRO Data Store will be unavailable from Friday 20th May from 15:00 until late on Saturday 21st May, to install additional disc. (I had previously reported that the disc had already been installed - I meant delivered!) This upgrade will add about 2 Tbyte of disc to the DMF disc cache area, and add about 1 Tbyte of disc to the /cs/datastore file system. In addition, the $WORKDIR area will be doubled from 200 Gbyte to 400 Gbyte, and a 100 Gbyte $DATADIR area will be created - non-backed up, but not subject to flush. The /cs/datastore are will be rebuilt when incorporating the extra disc, to provide higher performance with 3-way striping. However, this will require the contents to be dumped, the new discs to be intialised and added, and the file system contents to be restored from dumps. Two copes of the dumps will be made (one to tape, one to disc). In addition, all file data will be migrated prior to the dumps, to reduce the dump time; but when the file system has been reloaded, most of the data will be restored from the cache area to the primary /cs/datastore area. The upgrade also gives an opportunity to test our ability to recover from a major incident, e.g. the loss of the /cs/datastore file system. Please ensure that your processing on other systems does not try to access cherax during the above time. Note that some of the HPSC WWW pages will also be unavailable during the above time - since there are demands for serving large files from the WWW pages, some of the services now run using the /cs/datastore file system. 9. Catching errors in shell scripts - avoiding disasters In HPCbull 134.10, we wrote an article about "Catching errors in shell scripts - avoiding disasters", and foreshadowed some recommendations. These recommendations are now available from the new FAQ area on the above WWW pages - see http://intra.hpsc.csiro.au/userguides/faq/shell_recovery.php which is referenced in http://intra.hpsc.csiro.au/userguides/faq/ and http://intra.hpsc.csiro.au/userguides/ . 10. Use of rsync for file transfer/synchronisation We recommend the use of rsync when ever possible for transferring files between systems - it has many advantages, including not transferring files to the destination if they are already there, and many other capabilities. More information is contained in a new FAQ - see http://intra.hpsc.csiro.au/userguides/faq/rsync.php which is referenced in http://intra.hpsc.csiro.au/userguides/faq/
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |