|
Bulletin 135 - 2005 Mar 23
1. HHPCCC - NQS II and ERS II upgrades, scheduler and batch changes 1.1 New versions of NQS II and ERS II New versions of NQS II and ERS II were installed on the SX-6/TX7 system and front-ends between 07:30 and 10:30 on Tuesday 22nd March. The new version has support for many new features, including the "-l cpunum_job=ncpus" parameter. Details of the changes can be found at http://www.hpccc.gov.au/hpccc/userdocs/NQS+ERS.upgrade.2005-03.shtml The output of erstatj and qstat commands has been altered slightly - there is a new field for the no-migrate flag (see item 1.3 below), and the priority field in the erstatj output has been widened (see item 4 below). The was one un-anticipated problem. For some users, rsh commands from the SX-6 nodes to the TX7s hung. If you encounter this problem, please contact HPCCC staff for assistance. We do not understand yet the exact environment that leads to the hangs, but it may be related to the use of the NQS II -S parameter, whose behaviour has changed (see HPCBull 134.2.3). It is preferable not to use this parameter to specify a batch job shell, but to use #!/bin/ksh as the first line of a script to specify the shell. We do know that in some circumstances, adding a -n flag to the SX-6 rsh commands (after the host name) by-passes the problem (by suppressing the reading of standard input). 1.2 "-l cpunum_job=ncpus" parameter Users should now add the "-l cpunum_job=ncpus" parameter to all NQSII jobs, with a deadline of 11th April for this to be done. Jobs should continue to also specify the parameter "-l cpunum_prc=ncpus", to specify to the Gang Scheduler how many CPUs will be used by each executable. For example, ensemble jobs should specify values of perhaps 8 and 1 for these parameters. The qsub wrapper will be updated to allow processing of this parameter for all queues. 1.3 NQS II no-migrate, no-hold/checkpoint and no-rerun flags There are three facilities that users should consider on all jobs. They are:
These abilities can be specified with the qsub parameters
The default is "y" for these three options. We recommend that users select the options -r y -H y -J y. The exceptions are:
In general, allowing your job to be migratable should give you better throughput. 2. New versions of netCDF libraries and utilities New versions of netCDF libraries and utilities are now available as development versions for evaluation. The new versions, netCDF 3.6.0p1 libraries and utilities ncdump and ncgen, are installed on all SX-6 nodes and the cross environments gale/eccles/mawson/cherax for execution on the SX-6s. Note that the default versions have not been changed, but will be at a later date if the new versions are found to be satisfactory.
Documentation for netCDF 3.6 is at the UCAR website:
The detailed release notes and directory structures for the new versions, prepared by Stephen Leak of NEC, are available on the HPCCC web site, under User Documentation. (However, please contact hpchelp rather than Stephen Leak for assistance.) NEC will consider the provision of additional netCDF utilities for the SX-6s as required. 3. Priority scheme The ERS and NQS system on the SX-6s has provision for a priority scheme. This will allow users to specify a priority for a job with the -p parameter. The HPCCC sees the need for the scheme, to allow users to select the priority of their work - to get fast turnaround for development work, and to defer less time-critical work. The priority scheme will not alter the over-riding URGENT class jobs, e.g. operational work. The scheduler can be set to give significant weight on the priority parameter, so that higher priority work would be more likely to start, and more likely to stay running. The scheduler would also be set up so that higher-priority would incur a higher 'charge' within ERS, so persistent use of higher priority would result in later work receiving lower priority. The priority specified on the jobs would also be reflected into the priority of access to processors at execution time, which would become active when there is over-commitment on nodes. User feedback on this proposal is sought. 4. cherax and CSIRO Data Store updates A patch was installed on cherax on Saturday 19th March, to enable better handling of the inode hash table in the kernel, and to enable accounting to be invoked without provoking crashes. The changes to the inode handling provided dramatic performance improvements, and the long delays in such things as character echo and file listing should be reduced. For example, we typically saw a delay of 10 to 100 s in the execution of a command (which should have taken 55 s) at least once per day. In the first few days after the patch install, the worst case delay was 2 s, with an average of about 0.1 s. For system-related commands dependent on file system scans, the improvement has been dramatic - this will enable much more responsive services on the Data Store. Here are some examples of typical elapsed times for tasks before and after the patch install.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |