Bulletin 139 - 2005 June 20

  1. New HPCCC qsub local version
  2. HPCCC Feedback
  3. New FORTRAN90/SX for SX-6s
  4. HPCCC SX-6 system - job migration
  5. Fortran reminder - implicit typing
  6. HPCCC Users' Liaison Meeting
  7. Data Stores at the HPCCC
  8. cherax and CSIRO Data Store downtimes and upgrades
  9. Upgrade to CSIRO Cluster system
  10. OPeNDAP on cherax
  11. Bureau SamFS Outage 2005-H008
  12. CSIRO Cluster system - batch job memory specification

1. New HPCCC qsub local version

The local qsub wrapper script will be updated on Tuesday morning, 21st June at about 08:45 - the new version corrects some problems, and also prints the date and time of job submission - in answer to a request from users - see req #5108.


2. HPCCC Feedback

A new feedback and suggestion form has been added to the HPCCC Help Desk menu on http://www.hpccc.gov.au.

Submissions are unattributed unless you chose to put in contact details.


3. New FORTRAN90/SX version for SX-6s

The 302 version of the FORTRAN90/SX compiler has been available for some time on the SX-6 front-end and cross systems. This has been evaluated by NEC Australia staff, and interested early adopters. Many items have been fixed or improved, and there have been no significant performance penalties.

All users who have not already tried 302 should do so this week because it is planned to make 302 the default version on Wednesday 29th June.

Users can try the new version by using the commands

sxcross latest

or

sxcross sxf90/rev302

on the cross-systems.

Users who have serious applications problems using 302 can continue accessing 285 by entering

sxcross sxf90/rev285.

Continuing with 285 is a workaround while your problem is being researched and corrected, and will generally not be a viable long-term option.

Please immediately report any problems you might experience with 302.

Use the sxenv command to check your SX cross environment.

See the description under Cross Systems in the Local Userguides, SX-6 Cluster Userguide under User Documentation at http://www.hpccc.gov.au/hpccc/


4. HPCCC SX-6 system - job migration

Job migration is now working successfully on nodes sx600 to sx617 of the SX-6 system.

However, we are still seeing a few job failures, caused by jobs using local disc on nodes, but not setting the flag -J n to specify no migration. When such a job is migrated, it can't find its files because they are local to the node where the job started.

Jobs that cannot be restarted by the ERS scheduler are assigned a LOW priority and left in a Zombie state. They can be seen via the "erstatj" output with a "Z" flag. Refer to the ERS man pages for further explanations: man erstatj on the TX7s.

The local SX-6 UserGuide at http://www.hpccc.gov.au/hpccc/userdocs/ has been updated to include information on the -J n flag

Users have a choice - either use local disc and specify no job migration, or don't use local disc but allow job migration. It is a trade-off between potentially faster i/o and faster throughput.

(Note that explicit copying of local files is available using the NQS II -G options, but the HPCCC needs to do further experimentation on this facility before making recommendations to users - see req #5043 for some work in progress.)


5. Fortran reminder - implicit typing

We recently had a case where some Fortran code failed to do what the user expected.

The problem was in calling a system function, which returned an integer result, but was converted to real because of implicit typing.

Our recommendation is that all Fortran code should include the declaration

Implicit None


6. HPCCC Users' Liaison Meeting

Minutes of meetings of the HPCCC Users' Liaison Meeting can be found at http://www.hpccc.gov.au/hpccc/meeting/user_liaison/.

The latest minutes contain some brief notes on HPCCC plans and reports.

Meetings are to be held on the second Thursday of each month for the remainder of this year. If you work remotely from the HPCCC site, but would like to attend a meeting, please contact Rob. Bell.


7. Data Stores at the HPCCC

The primary holdings in the CSIRO Data Store recently passed the 100 Tbyte mark. The Bureau's SAM-FS system holds 250 Tbyte of primary data.

Users are asked to consider carefully removing any data that is of no or low further value, but please be careful.

A good technique for increasing protection against error is to remove write protection from crucial files and directories, e.g.

chmod -R a-w keep*


8.cherax and CSIRO Data Store downtimes and upgrades

The expansion of the disc system underlying the main file systems on cherax has been delayed, and is now likely to happen in July. There will be further downtime.

The upgrade to the cache disc went ahead, and the cache area is now 4.6 Tbyte - all small files are now resident there.


9. Upgrade to CSIRO Cluster system

An upgrade of the IBM Cluster system (burnet) commenced on Wednesday 15th June. The system will grow as follows

Nodes Old New
dual 3.2 GHz Xeon processors
with 2 Gbyte per node and
17 28
dual 3.2 GHz Xeon processors
with 4 Gbyte per node and
17 28
dual 3.2 GHz EM64T processors
with 2 Gbyte per node and
with Infiniband interconnect
0 28

There will be a total of 84 dual-processors nodes, with a peak performance of 84 * 2 * 3.2 * 2 = 1.075 Tflop/s.

The new EM64T processors will allow the use of 64-bit mode computing, while retaining backward compatibility with ia32 architecture.

The Infiniband interconnect will be used for highly-connected problems across multiple nodes, and should provide a significantly better bandwidth and lower latency.

There will be an additional 1 Tbyte of disc on the management node.

On the 15th June, the following was accomplished:


Hardware:
  • installation of extra shared disk space on the management node
  • addition of eight extra 32-bit compute nodes
Software:
  • upgrade of batch system to Torque 1.2.0p4
  • change of OS on compute nodes from Redhat 9 to CentOS 3.4 (Centos 3.4 is a rebuild of Redhat Enterprise Linux 3)

Below is a list of items to be completed at subsequent downtimes, which will be advertised through the message of the day on burnet's management node.

  • Configuring and installing a new Gigabit Ethernet switch
  • Installing new blades for the existing chassis, moving existing blades to keep blades ordered according to how much ram they have installed,
  • Switching to maui scheduler
  • Setting up shared data space on the new disk which isn't backed up
  • Possibly moving /work to the new disk
  • Use xfs for the file systems on the new disk
  • Later will come the hardware and software install for the EM64T nodes and the Infiniband switches.

On the cluster, to set up your environment to use the Intel compilers (including shared libraries) please use pkgenv ifort or pkgenv icc (which will set your environment up to use the latest 8.1 versions of these compilers).

Other applications set up with pkgenv include the following: lam, mpich, ncarg, netcdf.

Type pkgenv to see a complete listing of applications that use pkgenv.

If you need pkgenv in ksh and it is not available, type:

. /usr/local/etc/pkgenv/pkgenv.sh
to enable pkgenv for use.


10. OPeNDAP on cherax

OPeNDAP is now deployed on 'cherax' and operating successfully.

OPeNDAP is designed to serve public data directly from a visible host. It required some tailoring to fit into the HPCCC environment where we wish to provide data to authenticated users only, and they have access only to their data, and the server (cherax) sits behind the proxy server (hpscworld).

A detailed administration guide is available on the intranet, and a brief end-user guide is available via the APAC software documentation system at http://nf.apac.edu.au/facilities/software/

For more information about setting up OPeNDAP services, contact the HPCCC.


11. Bureau SamFS Outage 2005-H008

Sam will be unavailable between 8:15 am and 11 am on Tuesday, 21 June 2005, for system maintenance and upgrade preparatory work. During this period file archiving and file retrievals will fail.


12. CSIRO Cluster system - batch job memory specification

We have a need to schedule jobs based on their memory usage on the CSIRO IBM cluster system.

The correct parameter to use is -l vmem=100MB or similar.

Would all users please start to specify a memory limit on their jobs. To help in this process, we will on Wednesday 29th June impose a default limit of 100 Mbyte. After that date, jobs without a vmem specification may well fail.

Note that the maximum memory that can be specified is about "2000MB" on the low-memory nodes and "4000MB" on the high-memory nodes - not "2GB" or "4GB".



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement