|
Bulletin 130 - 2004 Dec 08
1. HPCCC Global File Systems and the SYNC option The GFS SYNC option will be introduced onto the CSIRO nodes and file systems from 09:00 on Thursday 9th December. (See SYSTEM CHANGE NOTICE: 2004-C009 SYNC Option for CSIRO SX-GFS Filesystems). Both CSIRO and Bureau users are asked to immediately report any slowdowns in job throughput or reduced I/O performance. 2. HPCCC TX7 failure There was a TX7 overload condition in the early morning of Tuesday 7th December, and the fail-over to the other TX7 failed. All (51) running SX-6/TX7 jobs were killed. (See SYSTEM EVENT NOTIFICATION: Dual TX7 failure Tue 7-12-2004 ). Once the GFS SYNC option is fully implemented jobs will no longer have to be killed. 3. HPCCC contact With the move to IP-based 'phones at the HPCCC, many of the staff desktop systems are connected through the 'phone network. If you are unable to contact the usual Help Desk 'phone number of (03) 9669 8103, or get no response to urgent e-mails, then please contact 0428 108 333. 4. WWW page updates The HPCCC WWW pages at http://www.hpccc.gov.au/ are being filled rapidly. These pages are accessible to all Bureau and CSIRO staff. Meetings of minutes, such as the User's Liaison Meeting, can be found there from the link "Meeting Minutes". System change notices can be found from the link to "System Status". Note that the new NEC SX Fortran Programmers Guide is accessible from the "User Documentation" link. It contains useful information about things like the use of the environment variables F_SETBUF and F_HSDIR, currently pertinent to i/o performance on GFS file systems. Raw SX-6 usage statistics can be seen for now at http://www.hpccc.gov.au/hpccc/system_stats/usage/SX-6/. System-wide aggregated data graphs, thought to be most valuable for users, will be made available through the "system statistics" menu on www.hpccc.gov.au during the next weeks. This will be a work in progress as we learn which specific metrics are of sufficient interest to include. 5. CSIRO - Mathematica Mathematica is now available on cherax, and will be restricted to CSIRO users who have signed the CSIRO HPSC Software Access Agreement - see http://intra.hpsc.csiro.au/userguides/forms/HPSC_LicensedSoftware.pdf or contact hpchelp@hpsc.csiro.au. 6. cherax - solutions to problems 6.1 Slowdowns SGI has provided us with workarounds for DMF which will control the file systems scanning, which causes the slow response on cherax. These updated DMF utilities were installed on Saturday 4th December. Since then, there have been no significant slowdowns on cherax. SGI has also identified the real cause of the slowdowns, which is associated with caching of inodes in memory. A fix is in the pipeline. 6.2 Crashes SGI has identified the cause of the three most recent crashes - an error in network drivers when there is an uneven amount of memory in the internal 'nodes. A fix is in the pipeline. 6.3 Files with wrong ownership SGI has also identified the causes for the problem of files having the wrong ownership, and a fix is being tested. 6.4 Failing dmls command We have identified the circumstances under which the dmls command crashes (an NFS mount problem). SGI has provided a version of the command for testing to provide better diagnosis of the failure. 6.5 Fixes and outages SGI is working to bring these fixes to cherax over the next two weeks. We may have a series of outages to install the fixes as they arrive, or wait until they have all arrived and install them together. In any case, we will need to upgrade to a new version of Propack, SGI's Linux enhancement package for the Altix. Forthcoming outages will be notified in the messages of the day shown upon login. If you want to run long-running jobs, then please ask us for updates on the expected outages. The good news is that once these fixes are in place, we expect the Altix to provide a reliable service again. 6.6 Flushing of $WORKDIR Note that flushing of the $WORKDIR (/work) areas on cherax has started - see the file /work/flush.status for the latest.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |