|
Bulletin 113 - 2004 May 07
The HPCCC plans to close the SX-5 systems, florey and russell, to general user service on Monday 17th May at 4 pm. The systems will be kept live for a further two weeks for Operational requirements, and for forgotten items. The firm schedule is for the SX-5 systems to be turned off on Monday 31st May at 4 pm. and then de-commissioned. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ANY USERS WHO HAVE NOT STARTED TRANSFERRING THEIR APPLICATIONS TO THE SX-6/TX7 SYSTEM ARE URGED TO START IMMEDIATELY!!! XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX :-( :-( :-( :-( :-( :-( :-( :-( :-( :-( :-( :-( You might ask, what about the files? The only announced transfer of files is for the /cs/home file system (see item 6 below). Please note that nothing in the temporary file areas on the SX-5s will be transferred. Backups, as a measure of last resort, will be taken of some of the user areas prior to the de-commissioning, depending on the volume of files remaining. To assist with the final shutdown down of russell and florey, could Bureau users please copy their files off russell and florey and remove the files once they have been copied or if no longer required. All users should ensure all files of continuing usefulness are transferred to somewhere for safe keeping well before the 17th May. Please check the following file systems for your files: /work /stmp /tmp /htmp /bm/home /bm/data /bm/datas /bm/bmrc/share /bm/tmp /bm/nmoc/srt /bm/nmoc/ltmp /bm/nmoc/stmp /bm/nmoc/lkeep /bm/nmoc/skeep /bm/nmoc/lrt /nec/home /cs/data /cs/home2. SX-6/TX7 problems If you have a problem with the NQS II qsub to SX-6 seeming to fail (i.e. a job is not submitted), please document to HPCCC which system you performed the qsub on, details of whether the qsub was in other scripts or stand-alone, and the exact time of the failure. If you believe you have a job failure related to any kind of SX-GFS anomaly, please report the details, SX-6 nodes involved, and exact time of the anomaly to HPCCC. It has been observed that because of the large I/O buffers we have recommended for your applications (F_SETBUF), closing files and flushing these large buffers to disk can take some number of seconds. There is some evidence that serial tasks that are dependent on passing data through files may not be performed in the correct order. It may be advisable to insert a wait statement between some processes to ensure there is no race condition with the buffer flushing from task1, and opening and reading the file in task2. If there are any other problems which may be concerned with data integrity, it is important that the code be re-compiled for the SX-6s and using the SX-6 libraries. 3. Scheduling on the SX-6sScheduling on the SX6s is of course interesting - the placement of many multi-processor jobs within and across nodes is a difficult task within a rapidly changing workload. We have set up each SX-6 node to allow only 7 processors for multi-processor jobs. We recently had jobs not running quickly because we had an over-supply of 4-CPU jobs, only one of which could be run at any time-slice on a node. So, some users with shared-memory multi-processor jobs might like to consider requesting only 3 processors instead of 4, or trying for seven instead, to make better use of the node resources. Other issues include the use of Gang Scheduling (recently switched on), process priorities, running multi-node jobs, and i/o slow-downs. 4. The right machinesPlease continue to use the TX7 eccles for Bureau front-end access, and the TX7 mawson for CSIRO. If you need to login to an SX-6 node, e.g. to check on your login setup, then please continue to use the right SX-6 nodes
Use rlogin or rsh from a TX7, or simply use the do_sx6 command from a TX7. 5. CSIRO suggestions for effective usage of the SX-6/TX7/AltixNaturally, there has been some uncertainty about the best way to work on the new systems. For CSIRO, with the arrival of the SGI Altix (the new cherax running Linux), many of the functions previously best done on the portal (farrer) can be done just as easily on cherax, with the added advantage of being closely integrated with the Data Store. So far, we have not cross-mounted the major CSIRO file systems between cherax and the TX7s, and probably will not do so, because of the added unreliability this can bring. Typical file transfer speeds are over 30 Mbyte/s: on cherax, do something like: rcp -p file mawson-direct: Within jobs running on the SX-6s, do something like: do_tx7 rcp -p file cherax-direct: So, the filesystems for the SX-6/TX7 cluster and the CSIRO Data Store are quite separate with no cross mounting. The SX-6 cluster sees a common set of filesystems between all of the SX-6 nodes and the two TX7 front ends. You can (generally speaking) only login/transfer files to the TX7s directly as there is only internal networking to the SX-6 nodes in the SX-6 cluster. The CSIRO TX7 is mawson. Like cherax it is a capable ia64 machine running Linux - but the TX7's main purpose is to host the shared filesystem and we cannot put other significant load on them (i/o and housekeeping are the exception and there are nqsII queues to help manage short jobs). There are SX-6 cross compilers on cherax. After you compile you need to transfer your executable and data files to mawson before you run jobs - however you can use the nqsII qsub on cherax to start the jobs. After your job has run you should probably transfer the outputs back to cherax to be automatically backed up in the datastore. With some effort you can script all of this... Cherax should be your main interactive HPSC/HPCCC host and the SX-6 seen as a vector computational backend. The only residual use of farrer (portal) that we know of is for applications that are optimised for the X86 (IA32) architecture. The main example we know about is ferret. We are looking to supporting such applications using an X86 cluster that will be well integrated with cherax. Let us know what your needs are for that system. Alternatively if you are using farrer for facilities not available on cherax let us know about that. 6. Destiny of the /cs/home file system from florey/russellFor some months, the SX-5 /cs/home file system has been mirrored onto cherax at cherax:/cs/datastore/SX5userdata/csdiv/csxxx. This is updated twice daily. When the SX-5s are decommissioned, the updating will obviously be stopped, and we plan to move each user's directory from the above location to a directory ~/SX-5 for each user on cherax (provided that directory does not already exist), so that the files will be under each users' control. Backups of the /cs/home file system will be kept for up to one year, but we signal that we intend to remove all the old backups of the florey/russell /cs/home file system after 30 June 2005.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |