|
Bulletin 147 - 2005 October 03
1. HPCCC SX-6/TX7 enhancement to NQSII qstat - special user privilege The default NQSII qstat command gives access to information about only the calling user's jobs, unless a higher level of privilege is available. At present, the local qstato command gives a controlled higher level of privilege, so that users can see the information about all users jobs. The latest version of qstat has an additional privilege level called "Special user privilege", invoked by including the parameter -P s on the qstat command (or giving the shell variable NQSII_PRIV the value PRIV_SPU). The "Special user privilege" has now been enabled on the system, and this allows users to access information about all jobs with the released NEC qstat command. At some time in the future, the qstato command will not be supported. 2. HPCCC new qsub and qsubnew versions A new version of qsubnew was recently installed - it fixes the behaviour of -R (show revision only and exit) and also displays the qsub wrapper script version (now in RCS). For example: [abc123@tx701 ~]$ qsubnew -R qsub wrapper version: 1.4 NQSII CUI Version R02.20 / API Version R02.20 (linux64) This new version will be installed as the default local qsub wrapper on Wednesday 5 October. 3. HPCCC SX-6/TX7 outages - software upgrades During October, HPCCC and NEC staff will be upgrading the operating system on the TX7s. The major new features are:
The new version of Linux on the TX7s has been running for about 6 months on a test TX7. The following outages are planned, to update the software, and bring the new features into service.
Phase 1a Tue 4 Oct. 0830-0930. mawson software upgrade.
Phase 1b Wed 5 Oct. 0830-0930. eccles software upgrade.
Phase 2 Wed 12 Oct. 0700-1100. Jobs to be held from 0700.
Phase 3 Wed 19 Oct. 0700-1100. Jobs to be held from 0700.
Phase 4 Wed 26 Oct. 0730-1100. Jobs to be held from 0730. During November, there will be a further outage for SX-6 Kernel and NQSII/ERSII and TX7 critical patches.
Tue 8 Nov. 0730-1030. Jobs to be held from 0700. (All HPCCC announced outages can be postponed if there is a significant weather event or other Bureau operational factor.) 4. Totalview class scheduled TotalView is the debugger of choice for parallel programming, and especially for MPI. It is available on the SX-6. Joerg Henrichs (NECA) will conduct an SX-6 Totalview debugging seminar on Tuesday 11 October.
10:00-11:30 TotalView Lecture in the BMRC 9E Lecture Room To register contact Len.Makin@csiro.au, or 03 9669 8109. Only four places remain. 5. HPCCC C++/SX and Fortran90/SX compiler updates Tests of the R313 Fortran90/SX compiler have so far been satisfactory. The HPCCC plans to make R313 the default Fortran SX compiler at 09:00 on Wednesday 9 November on the SX-6s, the TX7s, and all the cross platforms. Please advise if there are problems with this, and we will then modify plans accordingly to suit your needs. At the same time we will also change the C++ compiler default to Rev.067. Older versions of the compilers will remain available: for example, you can access the R302 compiler (the current default) with the command sxcross sxf90/rev302 sxcross -h shows all the available versions and options. 6. HPCCC new versions of netCDF and NCO New versions of netCDF and NCO are to be installed on the SX-6/TX7 system and gale at 09:00 on 9 November. The version to be installed is currently planned to be 3.6.0-p1, but the NCO web site has announced release 3.0.2 which contains a bug fix, and that may be the target version. 7. Linux Cluster problems, and file transfers There have been many hardware failures on the burnet cluster recently. We are liaising with IBM to seek an improvement in reliability. In particular, burnet's head node crashed multiple times in September, and we have had 5 recent failures of disc drives. The cluster head nodes are also experiencing overloads, caused principally by NFS traffic, from the compute nodes and from cherax. Users running i/o intensive jobs should use local disk space on the compute nodes where possible (instead of writing to your home directory or /work via NFS) and then copy any data back to the head node by using scp or rcp to the head node. The environment variable $LOCALDIR will let you access scratch space on the local disk of a compute node. For example, on a compute node: to copy files from a compute node to your home directory on the head node: rcp file mgt: to copy files from a compute node to your work directory on the head node: rcp file mgt:$WORKDIR to copy files from the head node to the scratch space on a compute node you can do the following: scp mgt:file $LOCALDIR We have set up a direct connection between cherax and the compute nodes on burnet, to minimise traffic on the head node. To copy files from cherax to a compute node, you should now do the following: rcp cherax-cluster:file $LOCALDIR To copy files from the scratch space on a compute node to the CSIRO Data Store on cherax, you should now do the following: rcp $LOCALDIR/file cherax-cluster: More information will be given in the UserGuide at http://intra.hpsc.csiro.au/userguides/ax . If you need specific help with modifying any scripts, please let us know by emailing hpchelp@hpsc.csiro.au At some stage, we may reduce or indeed completely remove the NFS services, because of the overload problems they cause. 8. Linux Cluster upgrades and maintenance The burnet cluster will be down from 17:30 on Friday 7th October for most of the weekend to install a more resilient version of the operating system. Because it will be a new install, all running jobs will be lost, and queued jobs may also be lost. We will start draining the job queue about 24 hours beforehand. 9. Altix system upgrade CSIRO has taken delivery of an upgrade to cherax, the Altix system: this will take the system from 64 processors and 180 Gbyte of memory to 128 processors and 224 Gbyte of memory. The workload has been building up on the Altix - for example, during June there was an average of 89 processors worth of work queued or running in the batch system for the 64 available processors. Incorporation of the new hardware is expected to be complete by the end of October. This upgrade will mean further outages to configure, test, and of course, utilise the new hardware - details to provided later. 10. Altix system i/o overloading In recent months, there has been a class of work running on cherax which has the ability to swamp the i/o capabilities of the system - we have taken to suspending this work when DMF needs to carry out important housekeeping, because the i/o intensive work has a severe impact - we have seen tape drives capable of 30 Mbyte/s reduced to 1 Mbyte/s. At times, this heavy i/o load can cause other impacts - for example, the batch system times-out before being able to report status. There was no successful backup of the CSIRO datastore for four days last week because of this load. Accidently deleted files can now only be given some chance of a restore - previously this would have been guaranteed within a 35 day timeframe for any file present at the time of start of the nightly backup. Staff are investigating causes and remedies. 11. HPCCC - better access to software and versions HPCCC and NEC staff are investigating the modules package for better control of access to software and multiple versions. See http://modules.sourceforge.net/ and for the proposed implementation http://www.hpccc.gov.au/hpccc/userdocs/index_user.shtml . The modules package may allow us to replace our local sxcross and sxenv packages and similar, and provide easier-to-use facilities. Hence, we have stopped enhancements of these local utilities, pending trials and decisions about the modules package. 12. CSIRO HPSC network access failure The network link to CSIRO HPSC was down for most of the weekend 1-2 October. Service was restored at about 08:00 on 3 October. (A router had failures in both of its power supplies). Unfortunately, most of the CSIRO work running on the systems at the HPCCC failed over the weekend because of lack of network connectivity. We are investigating ways to provide a more reliable link, and to be able to restore service more rapidly.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |