Bulletin 148 - 2005 October 26

  1. HPCbull formats
  2. Access permissions - all systems
  3. SX-6/TX7 outages - software upgrades
  4. Totalview class
  5. SX-6 C++/SX and Fortran90/SX compiler updates - recompile?
  6. SX-6 - new versions of netCDF and NCO
  7. SX-6/TX7 - $TMPDIR re-work
  8. SX-6/TX7 file system stall
  9. SX-6 scheduling - single CPU jobs
  10. SX-6 job migration and use of local (non-GFS) files
  11. SX-6/TX7 - use of the -S shell parameter in NQSII batch jobs
  12. Global search capability in the req system
  13. On-line Learning
  14. Altix (cherax) system upgrade
  15. Altix (cherax) and CSIRO Data Store updates
  16. CSIRO Linux Cluster update
  17. CSIRO - Fortran compilers

1. HPCbull formats

There have been a number of requests for different formats of the HPCbull. We currently support a text version accessible on most of the HPCCC systems through the hpcbull command, we send an e-mail text version to users, and we have an html-format version on the HPCCC WWW pages. This version has each item linked from the contents list, for easier navigation to particular items.

As well, the text e-mail version now includes a link at the top to the WWW pages, so that users who prefer to read the bulletins on-line can easily go there from the e-mail.

Because the HPCbull attempts to address the needs of two organisations collaborating in the HPCCC, and there is a diversity of users, some of the items are relevant to only some users. The titles of the items attempt to guide users to items relevant to them, with general items and SX-6/TX7 items coming first, then Altix (cherax) items, then other CSIRO HPSC items. But there are significant cross-overs between the organisations - both CSIRO and Bureau users use the SX-6/TX7 system and the Altix, and there are collaborative projects such as Bluelink spanning the organisations.

[ page top ]


2. Access permissions - all systems

It is a good policy to ensure that files on HPCCC systems have the tightest access control as possible.

We have recently encountered an issue with NFS services, as typically used to allow machines to share file systems, and as underlying the NEC GFS services. In this case, the granting of execute permission also granted read permission to a file.

Users are urged to ensure their files are given the least access possible - consider commands like:

 
 chmod -R go-rwx $HOME

for example, to remove access to all other users.

As a minimum, users should remove write permissions for others from their files. To check for files with other-write permission, try:

 find . ! -type l -perm -2 -print -exec ls -ald {} \;

and to remove world write permission, try

 chmod -R o-w .

All user file system should be checked: $HOME, $DATADIR, $WORKDIR, etc.

For sharing with other users within a group, you can allow group read and execute permission. The HPCCC can set up special groups to allow sharing across sub-groups and between selected users across organisations.

(Removing as much write permission as possible is a good idea, to provide extra protection - particularly for archive data).

The HPCCC strongly recommends that users do not store passwords in files, for example, to expedite ftp usage. HPCCC staff can advise on alternatives.

We wouldn't be writing this item if we didn't think it important.

[ page top ]


3. SX-6/TX7 outages - software upgrades

The final outage to complete the upgrade of the TX7 was completed this morning.

During November, there is a further outage scheduled for SX-6 Kernel, and NQSII/ERSII and TX7 critical patches.

Tue 8 Nov. 0730-1030. Jobs to be held from 0700. Complete SX-6/TX7 outage.

HPCCC will advise closer to the time whether this outage will go ahead.

A patch is being installed on all SX-6 nodes to the tcsh shell - this corrects a problem which frequently prevents jobs using the tcsh from being checkpointed (and hence being saved in system shutdowns, or being migrated to less-busy nodes).

[ page top ]


4. Totalview class

A PC-video of the recent Totalview class is available. Contact Greg Roff. The class presentation is available at http://www.hpccc.gov.au/hpccc/seminars/ .

[ page top ]


5. SX-6 C++/SX and Fortran90/SX compiler updates - recompile?

No users have reported issues with making R313 the default Fortran SX compiler.

The HPCCC will go ahead with its plans to make R313 the default Fortran SX compiler at 09:00 on Wednesday 9 November on the SX-6s, the TX7s, and all the cross platforms.

At the same time we will also change the C++ compiler default to Rev.067.

Older versions of the compilers will remain available: for example, you can access the R302 compiler (the current default) with the command

 sxcross sxf90/rev302

sxcross -h shows all the available versions and options.

With the recent SUPER-UX upgrade to R15.1, and the change to the default compilers, we recommend that all users should re-compile and re-link their applications (while keeping an old executable!). This may provide performance improvements, and may identify problems before it is too late to find an answer.

HPCbull 145 publicised the sxcoffinfo utility, which can show compilation and library versions and dates.

  • simple help is provided by - sxcoffinfo -version
  • the man page is displayed by "man sxcoffinfo" or man -M /SX/local/man sxcoffinfo
  • for details of a particular object file use "-l objectfileName"

[ page top ]


6. SX-6 - new versions of netCDF and NCO

A new version of netCDF is to be installed on the SX-6/TX7 system and gale at 09:00 on 9 November. The version to be installed is netCDF 3.6.0-p1.

There is a new version of NCO, release 3.0.2, which contains bug fixes, and that will also be installed if we can adequately test it prior to the 9th.

[ page top ]


7. SX-6/TX7 - $TMPDIR re-work

The HPCCC and NEC are working on an improved setup of the $TMPDIR file system areas, for job and session-temporary files.

The naming convention for what underlies $TMPDIR is likely to change soon, and the cleanup of these areas should be more tightly controlled, using the NEC NQSII user exit capability, which allows site-specific code to be executed as jobs move from one state to another, e.g. from running to post-running.

As well, interactive $TMPDIR areas for a specific host will be removed whenever a host is re-booted, along with batch $TMPDIR areas where appropriate.

[ page top ]


8. SX-6/TX7 file system stall

There was a stall of the TX7 GFS services early on the morning of Sunday 15th. The services stalled as a file system filled.

HPCCC staff are taking measures to reduce the likelihood of file systems filling - user and group quotas are being imposed, and an automatic flushing script for the $WORKDIR areas will be run - see HPCbull 132, item 12, for a description of how the flushing will work.

NEC will have the correction on site by 25 November and it will be installed at the first opportunity thereafter.

[ page top ]


9. SX-6 scheduling - single CPU jobs

At times, the SX-6 system is very busy with multi-CPU jobs. However, there are difficulties with packing jobs onto the 8-CPU SX-6 nodes. For example, with a collection of 3-CPU jobs, then only two at a time can make progress on a node, and 2 CPUs are wasted.

There is an opportunity for a good supply of single CPU jobs to be run at the same time as multi-CPU jobs, and they are likely to get good throughput, providing they are migratable between nodes.

[ page top ]


10. SX-6 job migration and use of local (non-GFS) files

All SX-6 users should note that when using local files on SX-6 nodes, automatic job migration MUST be disabled or the job will fail if it is migrated. A migrated job does not take its files with it, so on restart it will fail and be put into the zombie state because the necessary files are no longer available to it. HPCCC staff will sometimes send e-mails notifying users of these jobs.

Redefining $TMPDIR from the default a job starts with is a common issue, along with use of local file systems and memory-resident file systems. If you must, also set the no migrate option - see HPCbull.135 item 3

(http://www.hpccc.gov.au/hpccc/user_news_advice/news/)

Note that by default all jobs are set to enable migration (-J y). If you use local files to the SX-6 node you MUST specify -J n.

A recent investigation for a major Bureau applications showed higher performance using GFS than using local SX-6 file systems, after i/o tuning had been carried out.

[ page top ]


11. SX-6/TX7 - use of the -S shell parameter in NQSII batch jobs

Some users had been using the -S path parameter in qsub commands or in batch jobs, to specify a shell for the batch job. man qsub includes:

     -S path_name_list
          Specify the shell to execute the shell script for batch
          requests.  

The HPCCC has found problems with this approach - not all the environment gets set up correctly, and there was a major problem with batch jobs using -S hanging earlier in the year.

The preferred approach is to include a line like:

 #!/bin/ksh

as the first line of a batch job, since this gives compatibility with interactive execution of the script.

Indeed, the current version of the qsub wrapper in use at the HPCCC rejects jobs which specify the -S option.

For users who do not use the qsub wrapper, the HPCCC will work with the users to help ensure the environment stays the same after the removal of the -S path parameters.

[ page top ]


12. Global search capability in the req system

The HPCCC request system now has an extra button giving a global search capability - this allows searches across multiple categories, such as active/resolved.

[ page top ]


13. On-line Learning

www.hpccc.gov.au/links/ includes links to useful on-line tutorial information from world-wide HPC sites. Specific note is called to many quality tutorials linked via "Computational Science and Engineering Learning Resources" and "Language Resources".

[ page top ]


14. Altix (cherax) system upgrade

The CSIRO Altix (cherax) and the CSIRO Data Store will be unavailable most of Monday 31st October, to allow the connection of the additional hardware - 64 additional processors and 64 Gbyte of memory.

All services depending on the availability of the Data Store will also be unavailable: for example, the http://www.hpsc.csiro.au/ WWW pages, and the $STOREDIR file system on the clusters

.

There will also be a minor software upgrade to Service Pack 6 with patches, to correct a recently seen problem.

[ page top ]


15. Altix (cherax) and CSIRO Data Store updates

There have been several service interruptions recently - details are given in http://intra.hpsc.csiro.au/user/incident_reports/ax/ .

We have revised the dump strategy for cherax, and backups are now completing successfully each night.

To ease the load and delays, we recently changed the strategy to flush far more off the primary disc overnight, to leave the disc relatively empty to cope with incoming data during the day, and reducing the need to write data to tape during prime time - this will allow better access to tape drives for recalls during prime-time. However, it leaves a longer window of vulnerability for data, when there is only one copy - on disc.

The i/o load on the system remains high, along with the amount of data being ingested - see http://intra.hpsc.csiro.au/user/usage/ds/ for information about the total holdings, and access to monthly reports of how much each user and group is holding. On one recent day, 860 Gbyte was ingested - that is about half of the maximum amount that can be ingested with the current tape drives and strategy. Every Gbyte ingested results in 5 Gbyte of i/o - writing to disc once, reading from disc twice and writing to tape twice.

The HPSC is looking at future limits on data growth - financial constraints, or cost recovery for large usage.

[ page top ]


16. CSIRO Linux Cluster update

There was a complete failure of the IBM RAID disk array on burnet on Thursday 6th October - a disc module failed, and the RAID unit failed in its attempt to rebuild the disc array. Consequently, all data on the disc unit was lost, including the complete configuration.

After several days, the system was re-built, using a newer operating system which had been scheduled to be installed. All user $HOME files were restored from a backup on cherax.

Application services were progressively restored - see the login messages.

HPCCC staff are taking measures to guard against a similar failure again, using the new second disc unit.

The batch queues on burnet have been updated to provide two execution queues - a short queue for jobs requesting up to 2 hours wall time and up to 4 nodes, and a general queue (dque) for other jobs. The default queue is now called world, and jobs automatically route to either short or dque.

The load in the dque will be limited, so that there will always be resources available to the short queue - typically for testing and development.

These changes were made because of the very long wall time being requested by many jobs, blocking other users from getting timely access. It is likely that we will have to limit very long jobs to a significantly smaller part of the machine.

[ page top ]


17. CSIRO - Fortran compilers

CSIRO HPSC has recently acquired a floating licence for the Intel Windows Fortran compiler - these licences can be made available to users throughout CSIRO for use on any machine, as for the existing Intel Linux Fortran compilers.

Please contact CSIRO HPSC staff for more information.

[ page top ]



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement