|
Bulletin 155 - 2006 May 12
1. Planning for the HPCCC The current HPCCC agreement terminates in March 2008, along with NEC's supply and support contract (if not extended). Because of the long lead-time in major equipment purchases, the HPCCC has commenced planning for 2008 and beyond, and needs to have a business plan completed by mid-2006. The HPCCC needs an indication from users about their requirements for High Performance Computing and storage from 2008 onwards. Bureau and CSIRO have already been asked for input through the HPCCC Steering Committee representatives. The Bureau has initiated internal reviews of requirements. CSIRO users are also asked to provide an indication of their likely requirements, to help in the decision-making process for the HPCCC and future equipment. Research group leaders are encouraged to make your needs known, giving plans for new modelling, increased model resolution, performance requirements, etc. Some indication of the type of equipment best suited to requirements, along with storage demands would also be helpful. In particular, it would be useful to document any desires for new classes of equipment or new services. We encourage readers to pass this request on to other computational scientists in your research groups, especially those who are not HPCCC users. We would be grateful to receive responses by 26th May - to Phil Tannenbaum for Bureau users, and to Rob Bell for CSIRO users. [ page top ] 2. SX-6 batch job limits - CHANGE Per-job maximum and default time limits, and per-job maximum and default memory limits will be imposed on non-operational SX-6 queues soon. The target time for the change is 09:30 Tuesday 23rd May. The maximum CPU time limit for these queues will be set at one week, and the maximum memory limit at 53800 Mbyte. A new queue, SEEKHELP will be created, to catch jobs with limits exceeding the maxima. Jobs which do not specify a per-job CPU time limit will get the default of 60 s, and jobs which do not specify a per-job memory limit will get the default of 100 Mbyte. For details, see the change notice to appear at http://www.hpccc.gov.au/hpccc/user_news_advice/system_change_notices/index.shtml [ page top ] 3. SX-6/TX7 Bureau $WORKDIR flushing down to 20 days - CHANGE On Thursday 10th May, the file system /bm/flush1 (which underlies the $WORKDIR area for some Bureau users) filled, causing some jobs to fail. At the HPCCC User Liaison meeting on 11th May, BMRC representatives requested that the flush exemption be reduced from 30 to 20 days. From 09:00 Tuesday 23rd May, the flushing of the /bm/flush* file systems will be changed, so that each morning, files not accessed or modified for 20 days instead of 30 days will be flushed. As well, work will be carried out to distribute Bureau users $WORKDIR directories more evenly across the /bm/flush* file systems. For details, see the change notice 2006-B010 to appear today at http://www.hpccc.gov.au/hpccc/user_news_advice/system_change_notices/index.shtml [ page top ] 4. SX-6/TX7 file system i/o spreading In order to increase spreading of the i/o load, we have set up a trial scheme where CSIRO users have multiple $DATADIR and $WORKDIR directories - try env | grep -i dir to see what directories are available. HPCCC staff are also considering ways to allow users to have semi-automatic spreading of files across multiple file systems for greater performance. [ page top ] 5. SX-6 multi-node job limits In HPCbull 154.7, there was an item about multi-node jobs limits, with the example:
#PBS -T mpisx -b 6
#PBS -l memsz_job=60gb
It was stated that this requests 6 nodes. This is not strictly true - it depends on the number of processes specified with the cpunum_job and cpunum_prc parameters. For example, a job specifying -b 6 but only one processor per job might have all six jobs placed on one node. [ page top ] 6. SX-6/TX-7 disk utilisation graphs A new HPCCC web page records the current and past utilisation of the global file systems across the SX-6/TX-7 cluster. The web page also generates a report if a file system is at or over 80% utilisation, specifying the usage by the top 10 top level directories. The web page can be found at, http://www.hpccc.gov.au/hpccc/system_stats/usage/SX-6/performance/disk_util/It will be useful for people looking for space in the /bm/data* directories. [ page top ] 7. HPCCC Data Stores Both the Bureau's SAM-FS and CSIRO DMF systems have recently been under pressure because of increased demands placed on them. SAM-FS has experienced a high demand for both storage and recall. A facility is being considered to give users the ability to request a limited group of files to be recalled, and thus reduce the number of tape mounts compared with serial recalls. The Bureau silo is experiencing high demand for data storage from both SAM-FS and MARS, and in the near future a large number of tapes will be stored off line. Once this happens, jobs requesting data may be required to check and/or reserve the tapes they require the night before. More details will follow. CSIRO has started the process to acquire another T9940B tape drive to be attached to cherax. The existing two T9940B drives have been heavily used recently - up to 12 hours per day writing new data and backups, leaving little time for recalls. This load is a consequence of both the increased incoming demand, and the need to contain the number of tape cartridges used - for both cost and silo capacity reasons. The threshold for deciding whether the first copies of file data will go to fast-access low-capacity tapes or to T9940 tapes has recently been lowered from 300 Mbyte to 30 Mbyte. All users are asked not to store large numbers of small files in the Data Stores - each file has an overhead, and the Stores are more geared to smaller numbers of large files. Consider the use of the tar utility or similar (or tardir on cherax - see man tardir) to consolidate small files into archives. However, cherax users are asked especially to be very careful when unpacking large archive files - try to unpack such files into $WORKDIR, so that the unpacked contents will not be subject to migration. Also, please do not use the update option on tar files without careful consideration. This requires DMF to abandon the original file, and write the entire updated file, leading to a high churn rate for data. [ page top ] 8. cherax - nco - change in default - CHANGE We plan to change the default version of nco on cherax to 3.1.2 at 09:30 Tuesday 16 May, and change the method of setting up your environment to access nco. Note that after the change, nco will not be available in your environment unless you do something to set it up - this is already the case on burnet. The build of version 3.1.2 supports accessing OPeNDAP resources (and has bug fixes relevant to CMAR). The current default version is installed in /usr/local(/bin) and will be removed as part of the process to make the new version the default. A copy of the old default has been installed in /tools/nco-2.9.8 See the nco website at http://nco.sourceforge.net/ for further information about the features of different versions of nco. To set up your environment to specifically choose version 3.1.2, run in your shell startup or batch job:
pkgenv nco-3.1.2]
And to setup your environment to use the new copy of the old default version - 2.9.8 run:
pkgenv nco-2.9.8
Or run
pkgenv nco
if you just want to use whatever version is currently set as the default using the new method (2.9.8 before the change, 3.1.2 after). If you think this might cause you problems, please run some tests before the change. [ page top ] 9. cherax outages There have been several outages on cherax since the last HPCbull - details are in the incident log at http://intra.hpsc.csiro.au/user/incident_reports/ax/[ page top ] 10. CSIRO HPSC staff changes Dr Rhys Francis has been asked by DEST to take a secondment from his roles of Director of CSIRO HPSC and CSIRO eScience and APAC GRID Manager, to be the Facilitator for the Platforms for Collaboration capability for the National Collaborative Research Infrastructure Strategy (NCRIS). See http://www.dest.gov.au/sectors/research_sector/policies_issues_reviews/key_issues/ncris/ orThe National Collaborative Research Infrastructure Strategy is a major initiative under the Government's Backing Australia's Ability - Building our Future through Science and Innovation. It aims to bring greater strategic direction and coordination to national research infrastructure investments. $542 million is available through to 2010/11 to provide researchers with access to major research facilities and the supporting infrastructure and networks necessary to undertake world-class research. We wish Rhys well for his new role, and thank him for his outstanding leadership for CSIRO HPSC and its involvement in the HPCCC. Ms Teresa Curcio commenced with CSIRO HPSC on 8th May as a part-time Administrative Support Officer. Teresa previously worked for CSIRO from 1986 to 2002. [ page top ] 11. CSIRO - Science Investment Process (SIP) and costings CSIRO HPSC has produced a draft document outlining proposed costs for various services to provide information for the SIP. CSIRO staff can access this document from the notes about charging at: http://intra.hpsc.csiro.au/user/general_information/financial_arrangements.shtml The document outlines:
[ page top ] 12. Silicon Graphics files for chapter 11 protection Silicon Graphics (or SGI), the supplier of CSIRO's and APAC and partners' Altix systems, has filed for Chapter 11 Protection in the USA. See http://www.sgi.com/company_info/newsroom/press_releases/2006/may/sgi_reorg.html orWe have assurances from the company of continued support for our systems and data management software. [ page top ] 13. Seminar - Subversion "Introduction to SVN" By NEC Staff Place: BMRC seminar room (Floor 9, east side). Time: 11.00am Fri 19 May 2006 "The goal of the Subversion project is to build a version control system that is a compelling replacement for CVS in the open source community." (See http://subversion.tigris.org/) [ page top ] 14. burnet - new compute node configurations - CHANGE We are working on changing the OS on the compute nodes of burnet to sles9. As part of the transition, some nodes will be available only to a 'test' queue. Also, the set of nodes with native 64-bit capability will be running a 64-bit version of sles9 and will allow compiling and running 64-bit executables, which may have significant performance improvements for some applications. There will be more information in burnet's message of the day as the newly configured nodes become available for testing. Binary compatibility is expected for most applications. [ page top ] 15. TinyURLs In several places in this bulletin, we have given TinyURLs as alternatives to long URLs, which often cause difficulties with line wrapping, etc. (See http://tinyurl.com/ for more information about TinyURLs.) Because of the lengths of some URLs, the HPCbull will endeavour to use TinyURLs for long external site references. Internal references will continue to be the traditional long format because of potential security implications. If you go to the indicated TinyURL, you are re-directed to the underlying location, and can see the full URL name. We would appreciate your comments on this usage, and whether you would be happy to have only the TinyURLs shown for some locations. [ page top ]
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |