|
Bulletin 139 - 2005 June 20
1. New HPCCC qsub local version The local qsub wrapper script will be updated on Tuesday morning, 21st June at about 08:45 - the new version corrects some problems, and also prints the date and time of job submission - in answer to a request from users - see req #5108. 2. HPCCC Feedback A new feedback and suggestion form has been added to the HPCCC Help Desk menu on http://www.hpccc.gov.au. Submissions are unattributed unless you chose to put in contact details. 3. New FORTRAN90/SX version for SX-6s The 302 version of the FORTRAN90/SX compiler has been available for some time on the SX-6 front-end and cross systems. This has been evaluated by NEC Australia staff, and interested early adopters. Many items have been fixed or improved, and there have been no significant performance penalties. All users who have not already tried 302 should do so this week because it is planned to make 302 the default version on Wednesday 29th June. Users can try the new version by using the commands sxcross latest or sxcross sxf90/rev302 on the cross-systems. Users who have serious applications problems using 302 can continue accessing 285 by entering sxcross sxf90/rev285. Continuing with 285 is a workaround while your problem is being researched and corrected, and will generally not be a viable long-term option. Please immediately report any problems you might experience with 302. Use the sxenv command to check your SX cross environment. See the description under Cross Systems in the Local Userguides, SX-6 Cluster Userguide under User Documentation at http://www.hpccc.gov.au/hpccc/ 4. HPCCC SX-6 system - job migration Job migration is now working successfully on nodes sx600 to sx617 of the SX-6 system. However, we are still seeing a few job failures, caused by jobs using local disc on nodes, but not setting the flag -J n to specify no migration. When such a job is migrated, it can't find its files because they are local to the node where the job started. Jobs that cannot be restarted by the ERS scheduler are assigned a LOW priority and left in a Zombie state. They can be seen via the "erstatj" output with a "Z" flag. Refer to the ERS man pages for further explanations: man erstatj on the TX7s. The local SX-6 UserGuide at http://www.hpccc.gov.au/hpccc/userdocs/ has been updated to include information on the -J n flag Users have a choice - either use local disc and specify no job migration, or don't use local disc but allow job migration. It is a trade-off between potentially faster i/o and faster throughput. (Note that explicit copying of local files is available using the NQS II -G options, but the HPCCC needs to do further experimentation on this facility before making recommendations to users - see req #5043 for some work in progress.) 5. Fortran reminder - implicit typing We recently had a case where some Fortran code failed to do what the user expected. The problem was in calling a system function, which returned an integer result, but was converted to real because of implicit typing. Our recommendation is that all Fortran code should include the declaration Implicit None 6. HPCCC Users' Liaison Meeting Minutes of meetings of the HPCCC Users' Liaison Meeting can be found at http://www.hpccc.gov.au/hpccc/meeting/user_liaison/. The latest minutes contain some brief notes on HPCCC plans and reports. Meetings are to be held on the second Thursday of each month for the remainder of this year. If you work remotely from the HPCCC site, but would like to attend a meeting, please contact Rob. Bell. 7. Data Stores at the HPCCC The primary holdings in the CSIRO Data Store recently passed the 100 Tbyte mark. The Bureau's SAM-FS system holds 250 Tbyte of primary data. Users are asked to consider carefully removing any data that is of no or low further value, but please be careful. A good technique for increasing protection against error is to remove write protection from crucial files and directories, e.g. chmod -R a-w keep* 8.cherax and CSIRO Data Store downtimes and upgrades The expansion of the disc system underlying the main file systems on cherax has been delayed, and is now likely to happen in July. There will be further downtime. The upgrade to the cache disc went ahead, and the cache area is now 4.6 Tbyte - all small files are now resident there. 9. Upgrade to CSIRO Cluster system An upgrade of the IBM Cluster system (burnet) commenced on Wednesday 15th June. The system will grow as follows
There will be a total of 84 dual-processors nodes, with a peak performance of 84 * 2 * 3.2 * 2 = 1.075 Tflop/s. The new EM64T processors will allow the use of 64-bit mode computing, while retaining backward compatibility with ia32 architecture. The Infiniband interconnect will be used for highly-connected problems across multiple nodes, and should provide a significantly better bandwidth and lower latency. There will be an additional 1 Tbyte of disc on the management node. On the 15th June, the following was accomplished: Hardware:
Below is a list of items to be completed at subsequent downtimes, which will be advertised through the message of the day on burnet's management node.
On the cluster, to set up your environment to use the Intel compilers (including shared libraries) please use pkgenv ifort or pkgenv icc (which will set your environment up to use the latest 8.1 versions of these compilers). Other applications set up with pkgenv include the following: lam, mpich, ncarg, netcdf. Type pkgenv to see a complete listing of applications that use pkgenv. If you need pkgenv in ksh and it is not available, type: . /usr/local/etc/pkgenv/pkgenv.shto enable pkgenv for use. 10. OPeNDAP on cherax OPeNDAP is now deployed on 'cherax' and operating successfully. OPeNDAP is designed to serve public data directly from a visible host. It required some tailoring to fit into the HPCCC environment where we wish to provide data to authenticated users only, and they have access only to their data, and the server (cherax) sits behind the proxy server (hpscworld). A detailed administration guide is available on the intranet, and a brief end-user guide is available via the APAC software documentation system at http://nf.apac.edu.au/facilities/software/ For more information about setting up OPeNDAP services, contact the HPCCC. 11. Bureau SamFS Outage 2005-H008 Sam will be unavailable between 8:15 am and 11 am on Tuesday, 21 June 2005, for system maintenance and upgrade preparatory work. During this period file archiving and file retrievals will fail. 12. CSIRO Cluster system - batch job memory specification We have a need to schedule jobs based on their memory usage on the CSIRO IBM cluster system. The correct parameter to use is -l vmem=100MB or similar. Would all users please start to specify a memory limit on their jobs. To help in this process, we will on Wednesday 29th June impose a default limit of 100 Mbyte. After that date, jobs without a vmem specification may well fail. Note that the maximum memory that can be specified is about "2000MB" on the low-memory nodes and "4000MB" on the high-memory nodes - not "2GB" or "4GB".
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |