Bulletin 115 - 2004 May 26

  1. Close of service on the SX-5s, florey and russell
  2. Welcome to Polly Morgan
  3. Scheduling - continuing changes to get the best result
  4. Variation in execution times - memory access contention.
  5. SX-5 files mirrored on cherax for CSIRO users
  6. Use of the do_tx7 script
  7. cherax failures, and stress on DMF
  8. Newer versions of software on cherax, and the pkgenv command
1. Close of service on the SX-5s, florey and russell

The HPCCC plans to shutdown the SX-5 systems, florey and russell, on

Monday 31st May at 4 pm.
Vale!

Please ensure you have all the files you need from the SX-5s stored elsewhere. (CSIRO users please see item 5 below).

2. Welcome to Polly Morgan

CSIRO HPSC and the HPCCC welcome Polly Morgan to the 24th Floor.

Polly comes from Monash University, and will be working in the systems administration area. She will be initially concentrating on the configuration of the new CSIRO cluster systems, as well as administration of the HPCCC systems and the Altix.

3. Scheduling - continuing changes to get the best result

HPCCC staff are watching carefully the scheduling of jobs on the SX-6s.

The initial concentration is on the best possible performance of the operational jobs. This is sometimes at odds with getting the best throughput for all other jobs. For example, at the extreme, we could reserve enough nodes for exclusive use for the operational work, but would then lose throughput for other work.

We are trying to share nodes to some extent, but are working on the impact on operational jobs of having other work at lower priority sharing the nodes. Possible contention is a major issue - memory access and i/o are the obvious ones.

We are awaiting a new capability (scheduling by requested number of CPUs) from NEC. (In the meantime, we are using memory-resident file-system (MRFS) allocation as a pseudo for the CPU allocation, by defining a default MRFS allocation for some queues, and load balancing using that pseudo-allocation.)

4. Variation in execution times - memory access contention.

We have seen jobs on the SX-6s vary in their user-CPU time by up to 60%. We believe that this could be caused by memory contention with other jobs on a node.

5. SX-5 files mirrored on cherax for CSIRO users

Please note that the area /cs/data/SX5userdata on cherax should be treated by users as a READ-ONLY area. Please don't create files in there, and for the few users on the SX-5s still, don't delete files on the SX-5 in the expectation that they will remain in /cs/data/SX5userdata: they won't!.

The mirroring process twice per day makes an exact image of each user's files from florey:/cs/home onto cherax:/cs/data/SX5userdata . This includes deleting all files not in florey:/cs/home .

At the close of the SX-5 service, we plan to move each CSIRO user's SX-5 files from the location cherax:/cs/data/SX5userdata/group/csabc into the directory ~abc123/SX-5 for each user on cherax (provided that directory does not already exist), so that the files will be under each users' control. If you do not want this action, plese let us know.

6. Use of the do_tx7 script

The do_tx7 script provides convenient functionality for scripts running on the SX-6s, to allow commands to be executed on the TX-7s where more appropriate or essential. For example, the wider networks are not visible from the SX-6s, and you might need to execute something like:

  do_tx7 rcp file gale:

However, the do_tx7 command has some overhead, and it may be better in some cases to have the do_tx7 command run a script on a TX7 than have a multitude of do_tx7 commands.

For example, use:

do_tx7 sh << EOF 
rcp $HOME/file1 host1:
rcp file2 host2:dir2
rcp file3* host3:dir3
EOF

or

do_tx7 "rcp $HOME/file1 host1:; rcp file2 host2:dir2; rcp file3* host3:dir3" 

rather than

do_tx7 rcp $HOME/file1 host1:
do_tx7 rcp file2 host2:dir2
do_tx7 rcp file3* host3:dir3

Users would be advised to check the status of any critical commands.

For very small files, it will be more efficient to do commands like a copy on the SX-6s themselves than to use do_tx7 to intiate the copy on a TX7.

7. cherax failure, and stress on DMF

cherax crashed between 09:30 and 09:40 on Friday 21 May, and around 10:00 on Tuesday 25th May. We don't know why yet. Service was restored within 30 minutes.

Also on Friday afternoon, the /cs/datastore file system filled for a brief time. We had hoped that this would not occur on the new system. We have lowered some thresholds, to make a re-occurrence less likely, and also requested an enhancement from SGI.

On Monday 24 May, there was a heavy load of retrievals, which meant that some requests were delayed. We will consider re-introducing the wrapper script for dmget, to automatically break large recall requests into smaller bunches, to allow a better response for all users. We have also requested an enhancement from SGI.

Finally, one user found that after recalling files, they were being re-migrated again rapidly. This is because file migration is based on the file size and access time, and recalling a file does not count as an access. If you are recalling large numbers of files in batches with the dmget command, then we recommend that you judiciously use the touch -a command to update the access time of the files, e.g.

 touch -a files*


8. Newer versions of software on cherax, and the pkgenv command

We have put some new software on cherax - along with pkgenv scripts to set up the environment. If all is working well, the following will set up your path to use the obvious packages:

pkgenv python-2.3.3
pkgenv perl-5.8.4
pkgenv tcl8.4.6

Also, we have installed a new autoconf and some related tools in /tools/gnu:

 pkgenv gnu

Users are invited to test these new versions.

Note that if pkgenv does not work for you in a given shell you can enable it via:

 source /usr/local/etc/pkgenv.csh 

or

 . /usr/local/etc/pkgenv.sh

In general, we plan to leave the released versions of software from SGI in the normal locations, and provide newer versions through the pkgenv facility, as above.



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement