|
Bulletin 129 - 2004 Nov 30
1. HPCCC Global File Systems and the SYNC option NEC and HPCCC are working towards using a new option with the NEC Global File System (GFS) on the SX-6/TX7 system. This option, called the SYNC option, provides more integrity for write operations. At present, when there is a failure of the TX7 systems or channel to the file systems, to avoid possible corruption all jobs on affected file systems are killed. With the change to the SYNC option, there will be no need to crash jobs during a TX7 fail-over - there were 58 jobs killed in October. Experience with the initial testing of the SYNC option in July has resulted in design improvements, and system patches were made available on 30th November. With the SYNC option, user jobs that are not using writes with large buffer sizes may experience slowdowns, and may slow down other jobs related to the same file system. Please see the next item. Some users with jobs that do large amounts of i/o will be contacted to ensure that their jobs are optimised. The installation of the new kernels on the SX-6 nodes and TX7s will take place from 30th November, and will require some downtimes, job delays, and interruptions to file access. The TX7 downtimes will be around 45 minutes each. These downtimes will be notified in advance, but are likely to be on the morning of Thursday 2nd December. After the kernel upgrades, nodes and file systems will be progressively moved to use the SYNC option, as testing of applications performance continues. 2. HPCCC SX-6 change to the default setting of F_SETBUF On Monday 29th November, the default value of the environment variable F_SETBUF was changed from 128 kbyte to 1024 kbyte (1 Mbyte) on the SX-6s. This will improve the performance of any i/o in Fortran jobs which have not set a specific value for this variable, and also improve the overall system i/o performance. However, there are some circumstances where the changes may not be beneficial, and users should consult HPCCC staff, or see req #4545 for detailed information about the impact of various settings of F_SETBUF and F_HSDIR for the performance of direct-access i/o. The increase in the F_SETBUF value will also cause a small increase in the memory requirements of programs - by at most 1 Mbyte for each open file, but see req #4545 for more details. 3. HPCCC TX7 large buffer i/o utilities To improve the performance of standard UNIX utilities on the TX7s, NEC has provided special versions with large (4 Mbyte) buffers. These utilities are located in /opt/gfsext/bin and /opt/gfsext/usr/bin and include commands like cp, cat, tar, etc. Man pages can be viewed by
"man -M /opt/gfsext/man:/opt/gfsext/share/man command"
Please verify your existing scripts work transparently with the utilities in /opt/gfsext/bin, before permanently pre-pending this directory to your PATH environment variable: e.g.
ksh users
PATH=/opt/gfsext/bin/:/opt/gfsext/usr/bin:${PATH}
csh users
setenv PATH
/opt/gfsext/bin:/opt/gfsext/usr/bin:${PATH}
(Simple tests of cat showed the new version typically used 1/10 the CPU time, and completed in about 1/3 the elapsed time). Note that several SUPER-UX utilities on the SX-6s have selectable i/o buffer sizes (using [-b bufsize] or similar); their man pages can be viewed using "sxman" on the TX7s: e.g. cat, cp, tar, cpio. 4. HPCCC SX-6 $LOCALDIR quotas, and user limits From 29th November, quotas of 80 Gbyte were set on the $LOCALDIR (/ltmp) file systems on the SX-6 nodes. As well, interactive sessions were limited to 30 minutes processor time, and 30 Gbyte of memory. These changes were made to prevent run-away tasks from impacting other users. The actual quota and limit values may be revised in future. 5. HPCCC SYSTEM CHANGE NOTICE: 2004-H030 NQSII qsub wrapper update TARGET Date of Change: MON-06-12-2004 SYSTEM: SX-6, TX7, cross environments, gale, cherax, farrer SOFTWARE AFFECTED: NQSII qsub IMPACT: new features available - no loss of functionality DETAILS: Reference req #3622 The qsub wrapper script for NQSII has been updated so that it can handle cases where multiple scripts are specified. This is a capability of the NQSII qsub binary (man qsub) which has not been used much at HPCCC to date. Multiple files can be specified, separated by spaces, ' ' and colons, ':' to confer information about dependencies between jobs. e.g..
qsub init.sh work1.sh:work2.sh:work3.sh finish.sh
Submits 5 jobs specifying that init.sh is to run first, and on successful completion, work1.sh, work2.sh and work3.sh are to be run in parallel, and if they all complete successfully, finish.sh will run. The individual scripts can contain directives to specify the job limits and queues. CONTACT: Gareth.Williams@csiro.au 9669 8114 NOTICE APPROVAL: Ramesh Balgovind 6. cherax - update on problems Unfortunately, cherax crashed between 13:50 and 14:00 on Monday 29th November. This was the first crash in over 18 days, which is an improvement. The investigation points to a communications failure between two modules, and is most likely hardware related. HPCCC and SGI staff have identified activities associated with the slowdowns, and SGI is producing special versions of DMF housekeeping commands, which will slow down the inode scanning, which seems to be the trigger for the slowdowns. SGI has identified a problem in Linux which relates to the wrong ownership of files problem. A new version of DMF was installed last weekend, which has some improvements and new features. We believe the core dumps of dmls are provoked when there are problems with the NFS mount of an ancillary disc on cherax (being used for data transfer). The setting of the variable NLSPATH in logins was changed so that Intel Fortran runtime messages can be displayed correctly. 7. cherax - $WORKDIR flushing On Tuesday, 23rd November, the $WORKDIR area on cherax filled. We manually flushed some old files. We will be implementing automated flushing of this area soon. The flushing will be carried out when the file system reaches a high usage level, and will remove files and empty directories, from the oldest to youngest, until the free space is sufficient. In any case, the process will not flush files modified or accessed in the last 7 days. The log of the flushing can be seen at $WORKDIR/../flush.status The recent flush removed files not accessed since 21 July this year. Please consider whether you have squirrelled files away in the $WORKDIR area on cherax, which you had never thought would be removed, but are now subject to removal. 8. CSIRO Software update - Gaussian 03 Gaussian 03 is now installed and ready for use on the SX-6 and Altix systems for CSIRO users who have signed the CSIRO HPSC Software Access Agreement - see http://intra.hpsc.csiro.au/forms/HPSC_LicensedSoftware.pdf or contact hpchelp@hpsc.csiro.au. From the Gaussian website at http://www.gaussian.com/ "Gaussian 03 is the latest in the Gaussian series of electronic structure programs. Gaussian 03 is used by chemists, chemical engineers, biochemists, physicists and others for research in established and emerging areas of chemical interest." "Starting from the basic laws of quantum mechanics, Gaussian predicts the energies, molecular structures, and vibrational frequencies of molecular systems, along with numerous molecular properties derived from these basic computation types. It can be used to study molecules and reactions under a wide range of conditions, including both stable species and compounds which are difficult or impossible to observe experimentally such as short-lived intermediates and transition structures." "Traditionally, proteins and other large biological molecules have been out of the reach of electronic structure methods. However, Gaussian 03s ONIOM method overcomes these limitations. ONIOM first appeared in Gaussian 98, and several significant innovations in Gaussian 03 make it applicable to much larger molecules." "Gaussian 03 brings a substantial maturing of features introduced in earlier versions (such as the ONIOM facility and the programs linear scaling techniques), increasing their applicability and efficiency. It also expands the range of properties that can be predicted, in both the gas phase and in solution. Finally, it introduces several new capabilities, now ready for initial use (e.g., periodic boundary conditions, ADMP molecular dynamics)." 9. HPCCC SX-6/TX7 system outage - McData switch re-configuration There is an outage scheduled for the morning of Tuesday 7th December, to update the McData Fibre channel switch to allow the connection of the new discs which come with the SX-6/TX7 system upgrade. Jobs will be held if possible, and there will be no interactive access to the SX-6/TX7 system.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |