Bulletin 197 - 2010 February 2

  1. Bureau of Meteorology Sun Constellation system (solar)
  2. Getting help with solar - solarhelp@bom.gov.au
  3. CSIRO Data Store usage report upgrade
  4. CSIRO ASC - 'procs' batch system resource
  5. CSIRO ASC - requesting resources on the GPU cluster
  6. CSIRO cherax error message
  7. CSIRO ASC New and Upgraded software

1. Bureau of Meteorology Sun Constellation system (solar)

The Sun Constellation system (solar) at the Bureau of Meteorology is being brought into production use, with work at present concentrated on porting the operational suite to the system. The system is expected to be available for general access by the end of February.

The SUN Constellation consists of 576 nodes, each with 2 quad-core Intel 64-bit Xeon processors (code named Nehalem), totalling 4608 CPU cores. Each node has 24 Gbytes of main memory and 24 Gbytes of flash memory instead of local disc. All of the nodes are connected by a dual-rail Infiniband network, with data rates of 40 Gbit/s per connection.

The system runs the CentOS distribution of Linux, Sun Grid Engine for job management and uses the Lustre global file system comprising of 115TB of disk space.

In addition, there are 4 user login nodes, and 6 data-mover nodes.

[ page top ]



2. Getting help with solar - solarhelp@bom.gov.au

A Userguide is being prepared, and is accessible at:

Users requiring assistance with solar should log an incident report using cSupport:

  • Report via email -
  • Report a problem via your web browser - http://helpdesk1.bom.gov.au/User (Note: this link only works from within the BOM network)

Problems reported via a web browser or email will be entered in the cSupport incident tracking system, and then can be made visible to the staff most able to solve the problem. You can use the web interface at http://helpdesk1.bom.gov.au/User to check progress of a problem and/or follow up the request by replying to email sent to you about the request. Please only include immediately relevant history in your reply as otherwise the information is duplicated in the system and the problem becomes difficult to follow.

If there is an urgent query out of hours, please contact Bureau operations staff on 03 9669 4006 who will contact appropriate support personnel.

[ page top ]



3. CSIRO Data Store usage report upgrade

Users can see reports on the Data Store use at http://intra.hpsc.csiro.au/user/usage/ds/

Reports are available for each group on the Data Store. Group names correspond to rather old names for CSIRO Divisions and special purpose groups.

We have added a new field to the reports - the average file size.

We would appreciate it if users aimed for larger rather than smaller file sizes. There is a large overhead in the recall of small files from tape. With the large-capacity T10000 drives, which read at 130 Mbyte/s (and faster with the average compression we see), then the recall of a 1 Gbyte file takes about 75 s to load, mount and position a tape, about 8 s to read the 1 Gbyte, and perhaps another minute to rewind, unload and put the tape away again.

So the tape read time is only about 2% of the total recall time for a 1 Gbyte file.

Please aim to have files around 1 Gbyte or bigger, or recall batches of files using the dmget command, so that multiple files are recalled from each tape.

[ page top ]



4. CSIRO ASC - 'procs' batch system resource

A new feature in the batch system has been enabled on cherax, burnet and the gpu cluster (linuxgpu). There is now a 'procs' resource which you can use to simply request the number of cpu-cores that your (MPI) jobs need. The 'procs' may be allocated on any mix of nodes (burnet and linuxgpu only - cherax is a single node) which may result in large jobs starting sooner, but increases the possibility of contention with other jobs.

Note: You can still use the 'nodes' and 'ppn' resource syntax. This may be useful on burnet and linuxgpu to ensure that the system assigns the desired cpu cores per-node, if required.

[ page top ]



5. CSIRO ASC - requesting resources on the GPU cluster

The scheduler configuration has been changed on the gpu cluster to define two gpu resources per node (up from one which was set as a temporary measure). The gpu resources are configured as generic countable resources (gres), which the scheduler assigns on a per-task basis.

At this time we don't recommend that users run concurrent separate gpu jobs on a node, so for simple serial gpu jobs it is important to request both gpus with gres=gpu:2

  qsub -l procs=1,gres=gpu:2 or qsub -l nodes=1,gres=gpu:2

For jobs needing gpus on multiple nodes one of the following would be appropriate (assuming 3 nodes).

If your code can only handle 1 gpu per node, request both:

  qsub -l nodes=3:ppn=1,gres=gpu:2

Or if you can use both gpus (with one core each to drive them)

  qsub -l nodes=3:ppn=2,gres=gpu

or if you can use extra cores and just want the whole node, you don't need to (and can't!) request gpus:

  qsub -l nodes=3:ppn=8

Other cases are possible if we get to a situation where concurrent gpu jobs on a node are OK, but these are the main expected cases for now.

[ page top ]



6. CSIRO cherax error message

One of the system scripts on cherax is often raising an error message

  Segmentation fault

since the system software upgrade in November 2009.

The cause of this is not yet known. If any user sees this error, please contact us: it is probably not a problem with your script.

[ page top ]



7. CSIRO ASC New and Upgraded software

  • Intel Development Tools
    • Fortran V11.1.064 (burnet,cherax)
    • C/C++ V11.1.064 (burnet,cherax)
    • Trace Analyzer and Collector V7.2.2.006 (burnet,cherax)

    For more information and usage instructions please see the software map: http://nf.nci.org.au/facilities/software/index.php?site=CSIRO

  • PGI Development Tools for GPUs
    • PGI Cluster Development Kit V10.0 for Linux (linuxgpu)

      PGI 2010 is the first general release to include full support for the PGI Accelerator Programming model v1.0 standard on x64 processor-based systems incorporating NVIDIA CUDA-enabled Graphical Processing Units (GPUs). In addition to supporting high-level programming of accelerators using the PGI Accelerator programming model, the PGI Release 2010 also includes PGI CUDA Fortran, an explicit GPU programming model and application programming interface (API) that gives expert programmers direct control of all aspects of programming NVIDIA GPUs.

      For more information on PGI 2010 Features and Performance see http://www.pgroup.com/support/new_rel.htm

      Please see the NCI software map for usage instructions. http://nf.apac.edu.au/facilities/software/index.php?site=CSIRO

[ page top ]



BoM Solar Help:

CSIRO ASC Help:

For urgent help at all times:
  • CSIRO users 0428 108 333
  • Bureau out of hours emergencies are managed through internal policy
HPCCC WWW Site: http://www.hpccc.gov.au/
CSIRO External ASC Site: http://www.hpsc.csiro.au/
CSIRO ASC Users' Site: http://intra.hpsc.csiro.au/

Comments to:


© Copyright 2010, CSIRO Australia
Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement