|
Bulletin 122 - 2004 Aug 04
The HPCCC offices and staff started to move from 150 Lonsdale Street to 700 Collins Street Docklands Vic 3008 on 29th July, and are now located there. There were some disruptions to HPCCC and CSIRO HPSC WWW services, and to some other services as servers were moved, particularly between 07:00 and 10:00 on Thursday 29th, and then from 17:00 onwards. The Help Desk 'phone number will not work for a while please send e-mail, or call Rob Bell on 03 9669 8102 or Len Makin on 03 9669 8109, or for urgent assistance, call the Bureau Operations, or 0428 108 333. The fax number will also not work for a while. 2. SX-6 Scheduling job checkpoints, job holds, job migrates and GFSSome users have used the NQS II "no checkpoint" flag on jobs to avoid an error in ksh. The error has been corrected, and all users who have added "no checkpoint" because of that error are now asked to remove it as soon as possible. In addition others using the "no checkpoint" flag should review their usage. Below are some reasons. The Enhanced Resource Scheduler has the ability to detect overloading of nodes, and to take action to checkpoint jobs, and optionally to migrate them to less busy nodes. However, although the scheduler has been attempting to checkpoint jobs (particularly when high-priority jobs start), we are finding that about half the jobs are not checkpointable. We are investigating the reasons why so many jobs cannot be checkpointed - most failures are because a process has an open socket. Recently, we have been testing manual migration of jobs to less busy nodes if you want your jobs to partake, please remove the no-checkpoint flags. We are awaiting a resolution of the new GFS 'SYNC' feature before starting automatic migration, (and continuing jobs over TX7 shutdowns). If your jobs are using a local file system (LOCALDIR or MMFSDIR), then the no checkpoint flag will be appropriate if job migration is in place. Note that NEC will provide a no_migrate flag for jobs later this year, to enable checkpointing without migration when a job is locked to a node or set of nodes. 3. HPCCC SYSTEM CHANGES - scheduling and queuesThe change "2004-C002 Separation of CSIRO multi-CPU jobs" was abandoned, because the feature notified in "2004-A004 - ERS upgrade - load balance option" improved the placement of jobs. As well, the first phase of the ocean modelling development is nearing completion, and there will be a reduced demand for resources on the CSIRO nodes. 4. HPCCC TX7 and SX-6 ServicesAll SX-6/TX7 users are reminded that usage policy for SX-6-TX7 is
Upon login, the C shell executes your .cshrc file, and then your .login file. Every time a new shell is invoked, the C shell executes your .cshrc file first, e.g. when using compound commands. Some users have large .cshrc files, containing many settings which would be better made once per session or job in a .login file. Over the thousands of lines of scripts executed each day, this adds up to a lot of overhead and interrupts on systems like the SX-6s. Only aliases which are wanted in every command string should be in your .cshrc file -almost everything else should be in the .login file. 6. CSIRO - new service for Power User GroupsCSIRO HPSC has a mandate to encourage the use of 'real horsepower' for all CSIRO science needs, not just those traditionally involved with the HPCCC. As part of that strategy, HPSC has a growing range of computing systems, now including:
In addition, HPSC will have access to any new system installed at APAC and is planning to have an involvement in a system installed at IVEC in Perth. However, we therefore need to better understanding the needs of Power Users, both in existing user communities and with groups who have not previously used the HPSC and HPCCC systems. To help with this process, we intend to reserve blocks of time on our computing resources, to port and tune major applications. This is part of the process of determining which style of system suits applications, so we can all make well-informed decisions in future years about major computing systems. Groups that might class themselves as Power Users are asked to make contact with HPSC staff to discuss their science and to schedule any 7. CSIRO External Services Network (ESN)CSIRO is implementing a re-structure of its network and servers, to provide an External Services Network separated from its internal network. The changes to bring this about will cause some changes to methods of accessing the HPCCC and CSIRO HPSC services from outside CSIRO.The HPSC Group will provide details of the implications for users in a subsequent HPCbull. 8. CSIRO - the new portal will be b2The new portal will be b2.hpsc.csiro.au and the names farrer and portal will point to it. It is mostly intended for applications that run significantly better on X86 (aka IA32) than IA64. We have installed ferret which is one such application. For the moment we ask farrer users to try b2 and report inadequacies to the portal@hpc.csiro.au address. 9. Intel compiler upgrade on the Altix (cherax)On 2004-07-22, Version 8.0 of the Intel Fortran90 compiler was installed on cherax. Programs and libraries compiled with Version 7 or earlier will need to be recompiled to be compatible with Version 8 compiled code. Makefiles and scripts for compiling will need changes. The compiler should now be called 'ifort' although 'efc' will still work (with a warning). From the Intel Release notes: Version 8.0 combines the Compaq Visual Fortran(tm) (CVF) front-end (Fortran language features) with the Intel Fortran back-end (code-generator and optimizer)...
Note: One symptom of a stack size problem is a "segmentation violation". Increase using e.g.
ulimit -s 32768 #bash,ksh 32Mb in default Kb
limit stacksize 32768 #csh,tcsh 32Mb in default Kb
If using OpenMP, it is also necessary to set an environment variable KMP_STACKSIZE, e.g.
export KMP_STACKSIZE=32m #bash,ksh
setenv KMP_STACKSIZE 32m #csh,tcsh
From user experience during testing, the new compiler optimisation is quite aggressive with full optimization (-O). Anyone having problems should recheck results with -O1. There's also a -mp option which is stricter about maintaining FP precision though it can have a large impact on performance. It's also well worth using the -fpe0 option which forces a crash on an FP error rather than continuing with NaNs and infinities (-fpe3). Use of this option can improve performance markedly. Full documentation is available from http://www.hpsc.csiro.au/ by following the link to "Documentation", then to "SGI Altix documentation" then to "Intel Fortran90 Documentation". Thanks to Martin Dix for his contribution to this item, and the following. 10. Use of the Altix (cherax) for production programs, and localityWith the upgrade of the Altix to 64 processors, there are opportunities for production work to be run on the system. Recently, one of our users explored the scaling of a multi-processor atmospheric model on the system. The user also explored how to control the locality of programs, that is, to ensure that the best processor is used to run code and data. Although the Altix has a global shared memory of 120 Gbyte, the system is actually divided into nodes, with each node having local memory. Performance varies with the location of the code and data relative to the processor used to execute the code. The user got good results with the command line: mpirun -np N dplace -s1 globpea for an MPI code. Here is a table of results:
However, timings were quite variable, particularly for the runs with large numbers of processors. 11. Altix (cherax) software upgrade - PropackThe upgrade due to go ahead last Saturday was cancelled, and is now scheduled for Saturday 7th August. On Saturday morning, 31st July, cherax was re-booted - some CSIRO SX-6 jobs making indirect references to cherax were lost. This may happen again during next Saturday's cherax outage - it is hard to tell which SX-6 jobs can safely continue while cherax is down. 12. Altix (cherax) DMF recall problem for executablesSGI has investigated the problem, with the recall of executable files reported in HPCbull 120.5.1. The problem is believed to be in the interaction between the XFS file systems and Linux, and a workaround is expected shortly. See req #4157.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |