|
Bulletin 132 - 2005 Jan 31
1. HPCCC - SX-6/TX7 upgrade availability >From Friday 21st January, the upgraded SX-6/TX7 system was made available by NEC to the HPCCC. However, system stability testing is in place for 30 days to 20 February, and there will be other acceptance tests run during this period, which may require some shutdowns. The upgraded system has 28 SX-6 nodes (23 for the Bureau, 5 for CSIRO), about 23 Tbyte of disc, and the TX7s have been upgraded to 16 processors each. The new nodes are available via the new queues bm2, bmml2 and bmmn2. We are attempting a different scheduling regime on the new nodes and queues in order to try to provide maximum possible performance for each individual job, as well as consistency of run times. See item 9 below. New disc areas for Bureau users will be made available, and the HPCCC will be spreading users' file holdings onto the new discs: especially the /bm/data* and /bm/flush* areas, to reduce the current frequent congestion on these areas. See item 4 below. Note that a joint project between CSIRO and the Bureau will be using nodes sx625-sx627. 2. HPCCC and CSIRO HPSC relationships Since 1997, CSIRO and the Bureau of Meteorology have collaborated in the HPCCC. Since early 2003, CSIRO has operated its High Performance Scientific Computing group in collaboration with the Bureau and the HPCCC, and co-located with the Bureau. Access to the HPCCC systems is one of the services provided by CSIRO HPSC. CSIRO has decided to operate its own systems at the HPCCC under the HPCCC banner, to provide for simplicity of operations, and to allow support staff to operate across multiple systems. The CSIRO systems include the Altix system hosting the CSIRO Data Store, and cluster systems. Bureau staff may access these systems for defined purposes. The complex relationship is reflected in the complexity of the emerging WWW pages for the HPCCC and CSIRO HPSC. 3. HPCCC - change of restart areas for Bureau SX-6 nodes All Bureau jobs on the SX-6/TX7 system now use the same restart area. This will allow jobs to be migrated between nodes when higher priority work arrives and there is over-commitment. However, we will wait until the no-migrate flag becomes available in NQS II, so that jobs using local disc can be shielded from migration. Patches for this have arrived, and we are waiting on clarification of operating system upgrades needed to support this and other upgrades. This facility is expected to be in operation by the end of February. 4. HPCCC - SX-6/TX7 i/o saturation We are continuing to see major slowdowns in jobs due to i/o contention and saturation. One job on one SX-6 node can saturate a file system. Multiple jobs, especially jobs doing small block i/o, can slow whole classes of jobs down. We have seen major slowdowns recently, particularly for jobs using the /bm/data2 and /bm/flush2 file systems. The HPCCC is in the process of moving some Bureau users' files from the existing /bm/data1, /bm/data2, /bm/flush1 and /bm/flush2 file systems to newer file systems: /bm/data 1 to 11 and /bm/flush 1 to 4: each file system is 0.5 Tbyte. These moves will be undertaken in individual consultation with the users. Users need to be vigilant in ensuring that their jobs select appropriate setting for file buffer sizes, and thus ensure that applications using GFS do not finish up using NFS or make poor use of the file buffers - the HPCCC is working on tools to help detect poorly-performing applications. See http://www.hpccc.gov.au/hpccc/userguides/faq/fortran_default_buffer_values.php for further guidance. We have seen jobs with copy or file housekeeping tasks taking hours. 5. Use of gzip and compress In the past, many of us had made use of utilities like gzip and compress, to reduce file sizes to conserve storage space and to make file transfer times shorter. However, gzip and compress make little sense in the HPCCC environment for most of us:
The only circumstance in which compression utilities should still be used is on large files which have to be transferred over slow links: but we still recommend experimentation with rsync first. If you really do need to use gzip on the TX7s, then please use the tuned versions referred to in HPCbull 129.3. See item 8 below. An extreme example: a 200 Mbyte file on a TX-7 took 4 s to transfer to cherax. gzip took 68 s to compress the file, while the tuned version took 61 s. The unzipping took 10 s and 9 s respectively. (The test file did not compress well). 6. HPCCC SX-6/TX7 - GFS Notes A reminder that the NEC GFS file systems are global to all SX-6 nodes, and to both TX7s. GFS is managed through NFS systems hooks, which provide far fewer file locking protections than traditional local file systems. Because there are a continuing, but declining, possibly spurious GFS file system problems reported, all users should ensure that:
7. REQ submissions by e-mail All users are again asked that e-mails submitted to the REQ system should adhere to etiquette as follows:
Thanks for your assistance and cooperation. 8. Use of TX7 utilities - changes to path To make the tuned file i/o utilities easier to use on the TX7s, we intend to augment the default path for users to ensure these tuned utilities are selected by default, especially in do_tx7 commands. Please send any comments on this proposal to the Help Service before the target date for the change on Tuesday 8th February. 9. HPCCC SX-6/TX7 - new Bureau queues New queues are available for Bureau users that we plan to use to provide 'allocated' job scheduling and family CPU scheduling on the new nodes. Simply, what this means is that a job will be loaded onto nodes and will run, uninterrupted, until it completes. These queues are FIFO with priority weighting for job initiation: there will be no resource overcommitment on the nodes, and the effective user time will be Elapsed_Wall_Time_of_Script * Number_CPUs_Reserved. Each job will be "charged" as using the number of CPUs reserved since no other job will be able to use them. When using these queues it is important to always use the same number of CPUs for each job step, or if that is not practical, to run multiple jobs each reserving only the appropriate number of CPUs that will be used for each application step. See also item 11 below. 10. Using clustered systems Cluster optimisation techniques require attention to MPI time/overheads and I/O as well as computational issues. All time when your CPU(s) are in any wait condition (SYNCH, Wait I/O, etc, is your time. It is likely many applications will achieve much improved elapsed time performance through attention to MPI and I/O, as compared to computation, based on the local state of development for many parallel, and especially, distributed memory applications. For example, I/O strategy for each application may execute optimally doing its own I/O within each task, while other applications may do better if only the master task does I/O on behalf of all slave tasks, and broadcasts input and collects output, and all flavours in between. Some applications will need no change, but each application should undergo a sanity check regarding its execution architecture for clustered distributed memory systems. Consider reliability, robustness of execution environment, and portability not just raw performance (include "do I want to have to redo this again in the future?" in your overall applications review) Contact HPCCC staff for further discussion. 11. Lumpy jobs - scheduling and multi-processing executables One of the problems for scheduling is how to handle 'lumpy' jobs, that is jobs whose resource requirements vary markedly during the jobs' progress. For example, if a job requests a maximum of say 7 CPUs on an SX-6 node, but is currently using only one CPU, a scheduling decision might be to start another job on the same node. This will lead to higher utilisation of the system in the long-term. If however, the first job then springs into life and wants 7 CPUs, then there is likely to be over-commitment on the node. This leads to slowdowns for one job or the other (or both), can lead to a blow-out in user CPU time (without Gang Scheduling), and can lead to extra i/o being required for swapping, checkpointing or migration. The above job behaviour is typical for a job that requires multiple CPUs for its main program execution, but has long preliminary and post-processing tails, typically using at most one CPU. There is a dilemma here. The decision to start another job was correct while the first job continued to make low CPU usage, and is correct for lumpy jobs that use variable numbers of CPUs. However, it is not a correct decision if jobs make good use of the requested numbers of CPUs. And, the scheduler has no hope of determining this in advance (since there are external factors that could slow down any job). However, scheduling by requested CPUs alone can lead to gross under-utilisation. A further complication recently encountered on the SX-6s is when a job contains multiple program executions, at least one of which requires more than one CPU, and requests this with the -l cpunum_prc job parameter. Unfortunately, all programs compiled with a multi-processing parameter (e.g. -P multi) will be given that same number of CPUs. The F_RSVTASK parameter has no effect on this behaviour. This is a function of Gang Scheduling, which is desirable for other reasons. The only reasonable solution is to ensure that all programs desired to be run with only one CPU do not get compiled with multi-processing options. In some cases, this will require two executables to be maintained - one for multi-process execution, and one for single-CPU execution. 12. Flushing of file systems As some of the file systems fill up, we need to undertake flushing. This has already started on cherax, and will be invoked on other systems at the HPCCC as required. All of the file systems subject to flushing are referenced by either $WORKDIR or $TMPDIR on the HPCCC and CSIRO systems at the HPCCC. When a flushable file system reaches a critical threshold, the files and directories are listed from oldest to youngest, and files are removed starting with the oldest until either:
Users are warned that files associated with long-running jobs could be caught by the flushing. 13. Documentation update New and updated userguides are accessible from http://hpsc.csiro.au/ -> Documentation -> HPSC User Guides and Documentation for installed HPSC Software and will be available from http://www.hpccc.gov.au/ shortly. Master Userguide
All of the following guides need to be read in conjunction with the Master Userguide. SX-6 Cluster Userguide
Altix Userguide
Data Storage Management Userguide
IA32 Cluster Userguide
14. cherax update and outage notification A patch is on-site to fix the three major problems reported late last year:
The patch will come into production at the next system interruption, or at a scheduled shutdown. We plan to have a shutdown on Wednesday morning 2nd February for the patch installation, and expect the system to be available again by 10 AM. Any running jobs without the no re-run flag in the system at the time of the shutdown will be restarted from the beginning. (We are considering shortening the maximum time limit on jobs to allow for regular maintenance periods.) There has been an extremely heavy recall load on cherax recently, with long delays. Jeroen is revising the algorithm controlling the 3 Tbyte disc cache, to improve the hit rate. We have seen some improvement in 'hits' already. We are also revising the algorithms used to select files for migration and for cache residency - a combination of size and age since last access is used. The HPCCC is purchasing another T9940B tape drive for the Data store, to help cope with the increasing load, particularly to speed up the writing of data. Note that 17 Tbyte of primary data was brought across to the new system and site in February 2004, and already the store holds 64 Tbyte of primary data - a compound annual growth rate of over 400%. 15. Tape services outages There will be an outage of all tape services between 08:00 and 09:30 on Tuesday 1st February. This will affect SAM-FS, MARS and the CSIRO Data Store. 16. Staff News CSIRO welcomes Dr Alfred Uhlherr to the HPSC group. Alf comes from CSIRO Molecular Science, and will take the position of Manager, Science Strategy for CSIRO HPSC.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |