|
Bulletin 120 - 2004 Jul 21
The HPCCC offices and staff are scheduled to move from 150 Lonsdale Street to 700 Collins Street this week. Please continue to send in reports of problems or requests for help, but we will not see some of the requests for a while, and will be slower to respond. We expect 'phone numbers and e-mail addresses to remain the same, but CSIRO staff members may have subsequent 'phone number changes to enable access to the CSIROTel network. The postal address will stay the same, but the street address will be: HPCCC The fax may not work for some time, and the number might change from 03 9669 8112 - please send any pending registration requests as soon as possible, and then please call us first if you need to send a fax. 2. New cluster systemsCSIRO has taken delivery of two IBM eServer Cluster 1350 systems. The first system is for CSIRO Marine Research and CSIRO Atmospheric Research for an external project, and the second system has been acquired by the HPSC. Each system is comprised of nodes with dual 3.2 GHz Xeon processors, each with a head node, and 20 computational nodes (cluster one, "nelson"), and 32 computational nodes (cluster two, "burnet".) The total peak speed is 691.2 Gflop/s. Each system is capable of expansion to 140 processors in a single rack, or 1.8 Tflop/s peak total. The HPSC will make the second cluster available to targetted users, particularly for those applications which have not previously been ported to HPC systems. 3. CSIRO Applications Software InitiativeCSIRO HPSC has announced an initiate to make a large range of third-party software available to users. Initially, Gaussian, GaussView and a large range of 'engineering' software from MSC will be available, e.g. Nastran. The licensing for the MSC is based on tokens, with the full range of software being installed, but only a limited number of tokens at a time being available to run the software. The software will be able to be run on local machines as well as on the HPSC systems. Because of the cost of the licences, CSIRO HPSC is proposing a software access agreement and a flexible means for users and groups to contribute to the costs. Please contact Len Makin on 03 9669 8109, Len.Makin@csiro.au for further information. 4. SYSTEM CHANGE NOTICE 2004-C001 Separation of CSIRO multi-CPU jobsTARGET DATE OF CHANGE: 2004-07-22 10:30 AEST but may be delayed SYSTEMS AFFECTED: sx600 sx601 sx602 sx603 sx604 SOFTWARE AFFECTED: NQS II job servers IMPACT: At frequent intervals, the CSIRO workload is badly distributed across the five nodes. The Ocean development multi-node job runs on nodes sx600, sx601 and sx602, taking 7 CPUs on these nodes for most of the time. Other multi-CPU jobs on these three nodes get CPU time only briefly. while the nodes sx603 and sx604 are often under-utilised. Until the SYNC patch is installed to improve file system reliability, we will not invoke job migration. As an interim measure, we will stop jobs in the queue csml from using nodes sx600, sx601 and sx602. This will give better turnaround for these jobs. DETAILS: The csml queue will be stopped, and csml jobs drained from nodes sx600, sx601 and sx602. (The ocean job will be suspended during this process to speed the draining up). The nodes sx600, sx601 and sx602 will then be removed from the csml queue. CONTACT: Rob. Bell. NOTICE APPROVAL: Rob. Bell. 5.1 Retrieval problemThere is a current problem on the Altix where attempts to execute a migrated file fail. A message of the form: ./a.out: Input/output error. is issued when the retrieval fails (usually after several minutes, as the system tries to retrieve from the disc cache and two tape types). We will get the problem addressed by SGI. In the interim, please issue a dmget command before trying to execute a migrated file. 5.2 Locality problemsAlthough the Altix provides a globally addressable memory, the memory is in fact divided into separate memory around each node. Users may notice variation in the execution time of programs. This can be caused by locality issues: when the code is executed on one processor, but the code and data are actually stored at a far-off node, the execution time can be around twice as long. There are controlling mechanisms for this (cpu sets), and we are investigating in conjunction with some users. 5.3 Slow compilation with Intel Version 8 compilersWe noticed some long compilation times with the Intel version 8 compiler, and investigated. It was an issue with access to our new licence server, and is now resolved. 5.4 Batch jobs, interactive usage and limitsIn order to preserve good responsiveness for many users on cherax, it is important that major tasks be run in the batch system. To help this process, we would like to implement some interactive limits on Saturday 2004-07-31. We suggest initial limits of 30 CPU minutes per process. Interactive work will fall under a bootcpuset (system + interactive) to provide better control. See also the following item. 6. SYSTEM CHANGE NOTICE 2004-C003 Upgrade to CSIRO SGI Altix Programming EnvironmentTARGET DATE OF CHANGE: Saturday 2004-07-31 10:00 - 18:00 SYSTEMS AFFECTED: cherax SOFTWARE AFFECTED: SGI ProPack IMPACT: cherax will be unavailable from 10:00 - 18:00 (eight hours) on Saturday 2004-07-31. This will affect other systems which cross mount file systems from/to cherax, and SX-6 users who are expecting to transfer files to or from cherax as part of a job to be run during this time. Potentially all CSIRO HPSC and some Bureau users. DETAILS: SGI Propack 3.x provides enhancements designed for the Altix.For Users: CONTACT: Jeroen van den Muyzenberg NOTICE APPROVAL: Rob. Bell. 7. SYSTEM CHANGE NOTICE 2004-C002 Local Corrections and tempdTARGET DATE OF CHANGE: 2004-07-27 10:30 AEST SYSTEMS AFFECTED: SX600 SX601 SX602 SX603 SX604 SOFTWARE AFFECTED: Korn shell, mv, cp, and tempd. IMPACT: CSIRO users executing ksh mv or cp at the time of the change may have problems. We will attempt to hold all running jobs. Any jobs the cannot be held may fail. DETAILS: The local correction LC310082 replace the cp and mv commands and enable the -b option. LC310083 replaces ksh and is a bug fix. The tempd daemon will be enabled and will start monitoring and cleaning up /ltmp. These changes have been made to SX611 and have been in place for 1 week already with no problems reported. CONTACT: J.Stern NOTICE APPROVAL: Rob. Bell. 8. SYSTEM CHANGE NOTICE 2004-B002 SXCross structure change - reminder to usersTARGET DATE OF CHANGE: 2004-07-26 10:30 AEST SYSTEMS AFFECTED: Gale SOFTWARE AFFECTED: Software in /SX/SUPERUXR131 and /SX/SUPERUXR121 IMPACT: Users who have not started using the new sxcross structure may lose or have incorrect settings e.g. paths, environment variables, etc. DETAILS: Users have been urged to to use the new cross environment as per previous notices (see hpcbull 118). In particular System Change Notice 2004-02. The old cross structure used non standard locations for libraries, includes, and other related software. This is the final stage in the standardization of the cross environment. As previously advised the directories /SX/SUPERUXR121 and /SX/SUPERUXR131 will be made unavailable as they are not part of the new structure. Users are referred to http://www.hpsc.csiro.au/hpccc/userdocs/ug-sx.shtml section "Using sxcross" and also hpcbull 118. Users should check their scripts etc and make sure they have updated their environments per the details in the hpcbull and user documentation referenced above. CONTACT: J.Stern NOTICE APPROVAL: HPCCC Manager 9. Interactive use of the SX-6 nodesWe recently had a case where a user did some tests on an SX-6 node by logging in and running tasks interactively. Unfortunately, upon logout the user left behind some processes, including one which continued to increase in size. Eventually, it occupied nearly all the node memory, and terminated, but not before it caused other users' jobs to crash. Then another process with similar characteristics started again! Please do not use the SX-6 nodes interactively for tests. This has and can result in SX-6 operational problems, caused by various programming errors. Any potential system impact can be easily avoided through use of the batch facilities as outlined in the Local Userguide for the SX-6 Cluster at http://intra.hpsc.csiro.au/userguides/sx/ Everyone's cooperation is solicited to use the SX-6 only as a batch system, excepting when interactive debugging with TotalView, pdbx, sdb, etc is necessary. Note that a "quick test" is not an acceptable interactive debugging session.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |