|
Bulletin 128 - 2004 Oct 29
An outage on the morning of 28th October to replace a disc interface card resulted in more extensive disruption, with mawson being rebooted, and GFS failover and a pause in responses on eccles. Some jobs had to be killed to prevent possible i/o corruption. Sorry. 2. HPCCC all - reduced air-conditioning capabilityFrom today until mid next week, the machine room will be running with reduced cooling capability, while chillers are being moved from the basement area to the roof of the building. If the forecast temperatures are correct, then some load shedding may be needed. This will start with some of the SX-6 nodes. If the one working chiller has a problem, then much of the machine room equipment will need to be shutdown hurriedly. Running jobs may not be saved. The CSIRO cluster compute nodes (on nelson and burnet) may be switched off this evening. 3. Data protection - warningsFurther to the item hpcbull 127.4, here are the backup schedules for the SX-6/TX7 file systems. The following are the only SX-6/TX7 user areas which are backed up: Bureau users: weekly backup to disc: /bm/home /bm/share /bm/keep/bmrcshare /bm/keep /bm/nmoc/keep /bm/nmoc/eeraq1 /bm/nmoc/gasp /bm/nmoc/laps1 /bm/nmoc/gaspeps /bm/nmoc/laps2 /bm/nmoc/lapstr /bm/nmoc/lapseps /bm/nmoc/ocean /bm/nmoc/other CSIRO users: $HOME - an incremental dump is done every night into the cherax data store. (We will shortly be removing older files - we have kept so far every file caught by the backups since April). 4. HPCCC SX-6 do_tx7 and TX7 do_sx6 utilitiesUsers are _strongly_discouraged_ from using either do_tx7 or do_sx6 without great care. These utilities were developed to ease the transition from the tightly coupled SX-5 to the decoupled SX-6 environment. The utilities add significant risk to jobs because there is no direct way to know if the do_tx7 or do_sx6 worked correctly unless error checking is purposefully added. One way that reduces risk is to program scripts to self test by creating error paths in a do_tx7 script and leaving a status file at the end of the do_tx7. When control on the SX-6 script resumes, check the file. It could be a pass/fail via the contents, or as a simple locking mechanism. Always use a floating IP version of the TX7 identifier, not a TX7 specific one, e.g. use tx7. In reality anything operating between an SX-6 and TX7 that is a networked command will have similar risk issues. The differences are in being able to detect success/fail. Another solution is to replace network connections (do_ or rxxx commands) with NQSII batch jobs. NQSII does not provide the tight synchronisation of a DO_xx, but is more reliable in that a host crash will cause it to rerun, not disappear. A similar link can be done by email although with equally indeterminate lag. Please consider these points when developing your scripts. 5. Submissions to the REQ systemThe histories of problems submitted through the REQ system sometimes get very verbose. To assist HPCCC managing, and users reading, problem reports and resolutions, the following courtesies are requested for all submissions and follow up texts:
Thank you for your assistance in supporting maximum clarity within the REQ system. 6. HPCCC limits on SX-6 interactive sessionsAlthough we have strongly discouraged interactive access to the SX-6 nodes, there are occasions when this is useful, e.g. in de-bugging. To prevent a recurrence of problems, such as that experienced some months ago when we had a runaway process which consumed the entire memory of an SX-6 node, we will shortly impose interactive resource and disc quota limits on the SX-6 nodes, in response to user requests for this protection. The limits will be: process time - 30 minutes memory - 30 Gbyte $LOCALDIR - 80 Gbyte7. HPCCC Web www.hpccc.gov.au has been updated and is now beginning to be filled out with information. Today you should have access to a menu list on the left side. Specific attention is called to the "User Documentation" which, for the SX-6 now includes a search facility. For some of the search results, you have an option to highlight the search results; if you click on the result link you get text; if you click on the "highlight" option not only will the searched term continue to be highlighted, but if you click on the term in your results, the display will advance to the next occurrence. The NEC SX-6 Seminar presentation files (from last April) are also available from the menu. Access to this web site is automatically granted if you are on either the Bureau or CSIRO networks. If you are outside, there will be a username and password facility established. Usernames and an initial password will be provided by request. (The web site will not be integrated into NIS.) There will now be continuous additions and improvements. Note that non-Bureau, non-CSIRO access does not see the left hand side menu, and that the menus are slightly different for Bureau and CSIRO staff. 8. NEC FORTRAN90/SX Programmer Reference GuideNEC has delivered a new FORTRAN90/SX Programmer Reference Guide as part of their documentation improvement project. A pdf is available and can be accessed upon request to the HPCCC. During November an html version will be integrated into the documentation set on www.hpccc.gov.au. For those missing printed manuals, this one is printer friendly, and you may print your own from the pdf. 9. HPCCC ersys command updateThe ersys utility on the TX7s, gale and cherax has been changed to report time in UTC (rather than US EDT). 10. cherax - update on problemsSince the last HPCbull, we have had different problems with cherax.
The Grid Analysis and Display System (GrADS) has been installed on cherax, farrer and the clusters nelson and burnet. Use pkgenv grads to access the software on cherax. Ferret (Data Visualization and Analysis) will be installed soon on burnet. 12. Shared cherax usage, and transferring data between cherax and galeSeveral Bureau users now have access to cherax, for data sharing. Note that since the machine is now shared between Bureau and CSIRO users, users should check for any world read-write-execution permissions, and adjust according to their needs. Bureau users will mostly be using shared files with other users, so permissions on files and directories must be set to allow access. We can set up special groups to allow easy sharing if requested. Bureau users can now transfer data directly between cherax and gale. Transfer should be done using the jumbo frame network, using the names galejf and cheraxjf. You need to add the following line in your gale .rhosts: cheraxjf your_username and you need to add an .rhost file on cherax with: galejf your_username This will allow you to use rlogin, remsh/rsh and rcp between the 2 machines. Note that you need to use remsh from gale, rsh from cherax. Remember that cherax /bm/home is on a migrating file system, and there will be delays in accessing off-line files. If you want to check on the locality of a file, use the cherax command dmls -l files* for the list of files you want. Files in an off-line state are marking with OFL. You can explicitly request the retrieval of a batch of files with the command dmget files & See the temporary guide at http://intra.hpsc.csiro.au/datastore/userdocs/Altix_HPSC_guide/ for more information about working with a migrating file system. You can also access the files from the TX7 system with r- commands, using the names cheraxjf, tx7jf, mawsonjf and ecclesjf. You need to augment your .rhost file on cherax with: mawsonjf your_username ecclesjf your_username13. cherax - using ftp It has been reported that some ftp utilities to cherax can time-out and fail if there is a long delay in retrieving a migrated file. Usually, rcp is recommended for transfers between machines in the HPCCC, and scp for remote file transfers. ftp will work if you firstly ensure files are on-line using touch and dmget.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |