|
Bulletin 166 - 2007 February 14
Note: "CSIRO" items can apply to BoM users of cherax and burnet 1. Service Feedback Solicited To enable the HPCCC to assess how we are doing, we ask each user of our systems to complete the anonymous feedback and rating form at: http://www.hpccc.gov.au/hpccc/helpdesk/feedback.shtml For Bureau users who have previously completed the CCSB questionnaire, we would be grateful if you would also complete this one which is more focused on aspects of our systems. The results will be shared on the HPCCC web page once we get a meaningful number of responses. Thanks for your assistance. [ page top ] 2. Stuck SX-6 jobs, and elapsed time limits In recent months, we have had several occurrences of jobs getting stuck for various reasons. Such jobs can continue to hold reserved resources, and thus block other jobs. A patch will be installed on the 28th February to overcome one of the reasons - MPI jobs were not terminating cleanly, and were being left stranded. However, there are other reasons, e.g. failed file transfers, that cause job stalls. The ERS job scheduler will attempt to hold jobs that have stalled for too long (currently set to 3 hours), and send an e-mail to the user. However, if a job is not holdable (for example, if it has stalled with a do_tx7 in progress or a file transfer or r command), then somewhat cryptic e-mails will be sent about the failure to hold the job, and three attempts will be made. From the 28th February we intend to introduce and enforce elapsed (or wall) time limits for jobs, with a default of 1 hour and a maximum of 3 days. Jobs that exceed their elapsed time limit will be terminated by NQSII. Elapsed time limits should be specified with a setting like
-l elapstim_req=3600
on qsub commands, or embedded in jobs #PBS -l elapstim_req=1:30:00 Note that jobs do not accumulate 'elapsed time' when they are checkpointed and held. [ page top ] 3. Use of $WORKDIR file systems On all of the HPCCC systems (SX-6/TX7, Altix, IBM clusters), a directory called $WORKDIR is defined. This area is available to use for temporary files. There is no backup, there is no migration, there are quotas, and there is flushing. The areas are managed by running flushing scripts when the underlying file systems are close to filling. These scripts remove old files from oldest to newest, but stop before removing any recent files. The policy is described at http://www.hpccc.gov.au/hpccc/userguides/sx for the SX-6/TX7 system. The most recent log entries show that files as old as 5 months are still kept on cherax, and files 30 or more days old are kept on SX-6, depending on the file system. These windows of "safety" will be reduced as file usage goes up. On some file systems the file flush.status at the top level gives information about recent file flushings. The default inode quota for /work on cherax has been increased to 200,000 from 50,000. In general, these areas are somewhat under-utilised, and users with large data flows can make use of these storage areas while doing tasks like the analysis of large sets of data. However, please ensure that critical files are put into $HOME directories or archive areas. Use of the tar utility or similar to consolidate large numbers of small files into large archive files can help efficiencies, by keeping the number of files in migrating file systems from increasing indefinitely. [ page top ] 4. cherax slowdowns Systems staff identified that the recent slowdowns on cherax were caused by memory filling and system tasks spawning to release it. A new strategy has since been put into place to reduce the overheads of memory management. If you experience ongoing slowdowns please advise us. [ page top ] 5. cherax upgrade to SLES10 The SLES10 operating system is available for testing on cherax-1.hpsc.csiro.au. For more information on SLES for Altix see http://www.sgi.com/products/software/linux/suse.html, which also has a link to more information at Novell. We plan to switch cherax to SLES 10 on 24th February. Cherax will be unavailable from 9am to 3pm. The new version of the operating system gives improved TCP/IP implementation, which should lead to faster file transfers. [ page top ] 6. NQSII qsub wrapper upgrade The qsub wrapper script will be upgraded on 28th February to a new version, which has greater resiliency and supports some new queues. The new version is now available for testing as qsubnew. [ page top ]
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |