Bulletin 128 - 2004 Oct 29

  1. HPCCC TX7 extended outage on 28th October
  2. HPCCC all - reduced air-conditioning capability
  3. Data protection - warnings
  4. HPCCC SX-6 do_tx7 and TX7 do_sx6 utilities
  5. Submissions to the REQ system
  6. HPCCC limits on SX-6 interactive sessions
  7. HPCCC Web
  8. NEC FORTRAN90/SX Programmer Reference Guide
  9. HPCCC ersys command update
  10. cherax - update on problems
  11. CSIRO Software update - GRADS and ferret
  12. Shared cherax usage, and transferring data between cherax and gale
  13. cherax - using ftp
1. HPCCC TX7 extended outage on 28th October

An outage on the morning of 28th October to replace a disc interface card resulted in more extensive disruption, with mawson being rebooted, and GFS failover and a pause in responses on eccles. Some jobs had to be killed to prevent possible i/o corruption. Sorry.

2. HPCCC all - reduced air-conditioning capability

From today until mid next week, the machine room will be running with reduced cooling capability, while chillers are being moved from the basement area to the roof of the building. If the forecast temperatures are correct, then some load shedding may be needed. This will start with some of the SX-6 nodes.

If the one working chiller has a problem, then much of the machine room equipment will need to be shutdown hurriedly. Running jobs may not be saved.

The CSIRO cluster compute nodes (on nelson and burnet) may be switched off this evening.

3. Data protection - warnings

Further to the item hpcbull 127.4, here are the backup schedules for the SX-6/TX7 file systems.

The following are the only SX-6/TX7 user areas which are backed up:

Bureau users: weekly backup to disc:

 /bm/home /bm/share /bm/keep/bmrcshare 
 /bm/keep /bm/nmoc/keep /bm/nmoc/eeraq1
 /bm/nmoc/gasp /bm/nmoc/laps1 /bm/nmoc/gaspeps 
 /bm/nmoc/laps2 /bm/nmoc/lapstr /bm/nmoc/lapseps 
 /bm/nmoc/ocean /bm/nmoc/other 

CSIRO users: $HOME - an incremental dump is done every night into the cherax data store. (We will shortly be removing older files - we have kept so far every file caught by the backups since April).

4. HPCCC SX-6 do_tx7 and TX7 do_sx6 utilities

Users are _strongly_discouraged_ from using either do_tx7 or do_sx6 without great care.

These utilities were developed to ease the transition from the tightly coupled SX-5 to the decoupled SX-6 environment. The utilities add significant risk to jobs because there is no direct way to know if the do_tx7 or do_sx6 worked correctly unless error checking is purposefully added.

One way that reduces risk is to program scripts to self test by creating error paths in a do_tx7 script and leaving a status file at the end of the do_tx7. When control on the SX-6 script resumes, check the file. It could be a pass/fail via the contents, or as a simple locking mechanism.

Always use a floating IP version of the TX7 identifier, not a TX7 specific one, e.g. use tx7.

In reality anything operating between an SX-6 and TX7 that is a networked command will have similar risk issues. The differences are in being able to detect success/fail.

Another solution is to replace network connections (do_ or rxxx commands) with NQSII batch jobs. NQSII does not provide the tight synchronisation of a DO_xx, but is more reliable in that a host crash will cause it to rerun, not disappear. A similar link can be done by email although with equally indeterminate lag.

Please consider these points when developing your scripts.

5. Submissions to the REQ system

The histories of problems submitted through the REQ system sometimes get very verbose. To assist HPCCC managing, and users reading, problem reports and resolutions, the following courtesies are requested for all submissions and follow up texts:

  1. submit using "text only" email. do not send in "html" or "text and html." Many mailers have this feature as a user specified option in the mail options applet for sending mail. The html shows in the REQ log as lengthy html commands and has to be sifted and eventually manually removed;

  2. do not include previous text in a Reply. The previous text is already shown in the REQ, to which you are Replying. Replies continually including all previous text get very long indeed.

  3. When using email "Reply", do NOT use the Subject line of "Systems Support Online Autoreply" which our system sends as an automated acknowledgement of your request. The only thing you should keep is the "[req #nnnn]". Change the text to something meaningful. Otherwise REQ will recognize it and assume that a mailing loop has occurred and will therefore reject your response.

Thank you for your assistance in supporting maximum clarity within the REQ system.

6. HPCCC limits on SX-6 interactive sessions

Although we have strongly discouraged interactive access to the SX-6 nodes, there are occasions when this is useful, e.g. in de-bugging.

To prevent a recurrence of problems, such as that experienced some months ago when we had a runaway process which consumed the entire memory of an SX-6 node, we will shortly impose interactive resource and disc quota limits on the SX-6 nodes, in response to user requests for this protection.

The limits will be:

 process time - 30 minutes
 memory - 30 Gbyte
 $LOCALDIR - 80 Gbyte
7. HPCCC Web

www.hpccc.gov.au has been updated and is now beginning to be filled out with information. Today you should have access to a menu list on the left side. Specific attention is called to the "User Documentation" which, for the SX-6 now includes a search facility.

For some of the search results, you have an option to highlight the search results; if you click on the result link you get text; if you click on the "highlight" option not only will the searched term continue to be highlighted, but if you click on the term in your results, the display will advance to the next occurrence.

The NEC SX-6 Seminar presentation files (from last April) are also available from the menu.

Access to this web site is automatically granted if you are on either the Bureau or CSIRO networks. If you are outside, there will be a username and password facility established. Usernames and an initial password will be provided by request. (The web site will not be integrated into NIS.)

There will now be continuous additions and improvements. Note that non-Bureau, non-CSIRO access does not see the left hand side menu, and that the menus are slightly different for Bureau and CSIRO staff.

8. NEC FORTRAN90/SX Programmer Reference Guide

NEC has delivered a new FORTRAN90/SX Programmer Reference Guide as part of their documentation improvement project. A pdf is available and can be accessed upon request to the HPCCC. During November an html version will be integrated into the documentation set on www.hpccc.gov.au.

For those missing printed manuals, this one is printer friendly, and you may print your own from the pdf.

9. HPCCC ersys command update

The ersys utility on the TX7s, gale and cherax has been changed to report time in UTC (rather than US EDT).

10. cherax - update on problems

Since the last HPCbull, we have had different problems with cherax.

  • We have had two further crashes - these are believed to be related to hardware connection problems - the system was checked on Tuesday 26th October.
  • We have identified the activities associated with slowdowns. We will attempt to keep these activities outside prime time, but DMF may need to initiate such activities at any time in response to high space usage, and so we are vulnerable to such slowdowns.
  • We have seen file system problems, particularly wrong ownership of new files, and these files being of zero length. We did a full check of the /cs/datastore file system on Tuesday 26th October. SGI is investigating the crossed-file problem.
11. CSIRO Software update - GRADS and ferret

The Grid Analysis and Display System (GrADS) has been installed on cherax, farrer and the clusters nelson and burnet.

Use pkgenv grads to access the software on cherax.

Ferret (Data Visualization and Analysis) will be installed soon on burnet.

12. Shared cherax usage, and transferring data between cherax and gale

Several Bureau users now have access to cherax, for data sharing.

Note that since the machine is now shared between Bureau and CSIRO users, users should check for any world read-write-execution permissions, and adjust according to their needs.

Bureau users will mostly be using shared files with other users, so permissions on files and directories must be set to allow access. We can set up special groups to allow easy sharing if requested.

Bureau users can now transfer data directly between cherax and gale.

Transfer should be done using the jumbo frame network, using the names galejf and cheraxjf. You need to add the following line in your gale .rhosts:

 cheraxjf your_username

and you need to add an .rhost file on cherax with:

 galejf your_username

This will allow you to use rlogin, remsh/rsh and rcp between the 2 machines. Note that you need to use remsh from gale, rsh from cherax.

Remember that cherax /bm/home is on a migrating file system, and there will be delays in accessing off-line files. If you want to check on the locality of a file, use the cherax command

 dmls -l  files*

for the list of files you want. Files in an off-line state are marking with OFL. You can explicitly request the retrieval of a batch of files with the command

 dmget files &

See the temporary guide at http://intra.hpsc.csiro.au/datastore/userdocs/Altix_HPSC_guide/ for more information about working with a migrating file system.

You can also access the files from the TX7 system with r- commands, using the names cheraxjf, tx7jf, mawsonjf and ecclesjf.

You need to augment your .rhost file on cherax with:

 mawsonjf  your_username
 ecclesjf  your_username
13. cherax - using ftp

It has been reported that some ftp utilities to cherax can time-out and fail if there is a long delay in retrieving a migrated file.

Usually, rcp is recommended for transfers between machines in the HPCCC, and scp for remote file transfers. ftp will work if you firstly ensure files are on-line using touch and dmget.



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement