|
Bulletin 170 - 2007 June 29
Note: "CSIRO" items can apply to BoM users of cherax and burnet 1. Zombie SX-6 batch jobs When a problem occurs with holding, migration or checkpointing of a job that prevents the system resuming a job, the job is put into a Zombie state (shown with a 'Z' status in erstatj output). HPCCC staff get notification of these events, and will notify users, and suggest corrective actions. The most common cause for the failure is the use of local disc on a node, without specifying the NQSII qsub no-migrate option -J n. HPCCC staff will attempt to resume jobs manually. If that fails, then users will be notified, and asked to qdel the jobs. Because these jobs cause some disruption to the system, in future if no response is received from the user within one business day, such doomed jobs may be qdel'ed by HPCCC staff. [ page top ] 2. Reduced local disk on SX-6 nodes The available local disk on the SX-6 nodes, will be reduced to 45GB, as disk is required for swap space. Note that as a faster alternative, there is a memory-based local filesystem in /htmp (6GB capacity) and each SX-6 batch job gets a job-temporary directory made available as $MMFSDIR. [ page top ] 3. Default Fortran compiler for SX-6s is changing on 12th July The default SX-6 Fortran compiler available on HPCCC machines will change to rev 360 from 12th July. Older versions will continue to be available by specifying the version in the "module load" command. Use "module avail" to see what versions are available. [ page top ] 4. NEC SX Series Quick Reference Version 3 We have sent supplies of the NEC Quick Reference Guide to the SX-6 users. If anyone has been missed or would like a few additional copies, please contact us. [ page top ] 5. Scheduled Outages (CSIRO) CSIRO IM&T has announced "Network Maintenance Windows" for the CSIRO network of
"Local time" means local to the region of change activity. The establishment of these Network Maintenance Windows does NOT MEAN that the network will be unavailable during every Network Maintenance Window, but that where scheduled network outages are required, they will be scheduled to occur within the Network Maintenance Window. All Wednesday changes will be scheduled so they do not cause outage during business hours in any time zone. A change affecting more than the local region will start no earlier than 6pm AWST and extend no later than 10pm local time. The CSIRO AARNet 3 gateway for the Victorian Region is located at the HPCCC. At present, if this gateway is down, then even machine room communications for CSIRO systems (e.g. from the burnet cluster to the CSIRO Data Store) can fail. A new switch is coming to by-pass this problem. CSIRO HPSC will take advantage of these windows for scheduled downtimes whenever possible. However, some of the outages (such as file system moves) will need to be longer than 1-2 hours, and will not fit in the Wednesday time-slot on all occasions. Other major outages, will need to be carried out during business hours because of the need for on-site presence of vendor support staff. [ page top ] 6. Revised backup strategy for CSIRO HPSC systems cherax holds backups from a number of systems - e.g. the clusters, the SX-6 /cs/home areas, and various other services. There have been problems recently with the sheer numbers of files and directories in the backup area - the number of inodes exceeds 30 million, and scan times are delaying cherax backups and DMF housekeeping processes. We are revising the backup strategy. At present, we use hold backups for the last 30 days for most of these areas. We will now move to a strategy where we hold some backups for longer periods, but will provide a reduced coverage for further back in time. We will use a modified Tower of Hanoi strategy, so that we will have at any time:
We use the rsync --link-dest option, so that only one physical copy of files is stored, but the latest dump always contains a complete coverage at the time of the dump. See http://intra.hpsc.csiro.au/user/backupstatus/ for the times of the latest dumps. [ page top ] 7. Running jobs with dependencies on cherax and burnet There is a new utility on cherax and burnet that will queue jobs with dependencies (for torque). To use it for a simple sequence of jobs (job1.q, job2.q and job3.q) do: /usr/local/bin/qdep.pl job1.q job2.q job3.q It does a few other neat things as well.
wil240@cherax:~> /usr/local/bin/qdep.pl -h
usage: qdep.pl [options] [script] [script ...]
submits jobscripts from commandline or stdin (one filename
per line)
Sets up dependencies between jobs to run in sequence
unless the -f and/or -l options are specified.
Dependency syntax is compatible with torque qsub.
All other qsub options must be in directives in the scripts.
options: -f first_script (run "first_script" first, then the rest
of the scripts run in parallel on successful completion)
-l last_script (run "last_script" after completion
of all the other scripts - but see option -t)
-t only run last_script if all the others succeed
or each script in sequence if the preceding
one succeeds
-q (be quiet)
-h (this message)
examples: qdep.pl 1.q 2.q 3.q
qdep.pl -t 1.q 2.q 3.q
qdep.pl -f prep.q p1.q p2.q p3.q
qdep.pl p1.q p2.q p3.q -l cleanup.q
qdep.pl p1.q p2.q p3.q -t -l sum.q
version: 1.0
[ page top ] 8. Totalview Debugger - cherax and burnet The Totalview Debugger from Totalview Technologies (formerly Etnus) is now available on cherax and burnet. Visit www.totalviewtech.com for more product information. You can access totalview by doing: module load totalview For more details on usage visit the APAC software map http://nf.apac.edu.au/facilities/software/index.php?site=CSIRO We have a total of 8 license tokens available, 4 on each platform (IA64 and x86/x86_64). A token is used for each process being debugged by any number of users. The license tokens can potentially be used by other hosts in CSIRO with the same architecture. Please contact us if you need to use totalview from hosts other than cherax and burnet. [ page top ] 9. Scheduling with licensed software on burnet There have been scheduling changes on burnet to allow accounting for software with limited licenses or deployment (such as MATLAB, vmware, and windows). To let the batch system know that you need to use such software you should specify "-l software=MATLAB" as an argument or directive to qsub. For the list of supported software labels, and for the more complex syntax needed if you need to specify multiple software licenses, please see the userguide at http://intra.hpsc.csiro.au/userguides/blade/localguide.php#pbsresources. [ page top ] 10. Cherax: high performance scp transfers with hpn-ssh hpn-ssh is a patched version of ssh which has improved performance essential to high speed scp transfers. We are running a hpn-ssh sshd on port 22000 on cherax, and it is configured to only allow access with key-based authentication. The HPSC firewall has been configured to allow access to hpn-ssh from APAC-NF and iVEC. The end of the ssh connection receiving the data is the one where performance/buffering matters most so the following examples are for data transfers to cherax: cherax> module load hpn-ssh cherax> scp -rp me123@ac.apac.edu.au:forcherax/. fromapac/. cherax> rsync --rsh=ssh -av me123@ac.apac.edu.au:forcherax/. fromapac/. Initiated from apac:
ac> scp -i ~/keys/my_cherax_key -P 22000 -rp forcherax/. \
abc123@cherax.hpsc.csiro.au:fromapac/.
ac> rsync --rsh='ssh -i ~/keys/my_cherax_key -P 22000' -av forcherax/. \
abc123@cherax.hpsc.csiro.au:fromapac/.
For a higher performance transfer solution, gridftp is also available. [ page top ] 11. Module talk: 11th July "Introduction to the SX6/TX7 modules software environment" By Aaron McDonough and Hai Doan HPCCC Place: BMRC seminar room (Floor 9, east side). Time: 10.00am Wed 11th July 2007 [ page top ] 12. New software cherax: octave/2.9.12, gmt/4.2.0, hpn-ssh burnet: matlab/7.4, gams/22.4 (demo - provide your own license), grix, idl/6.3 (runtime engine) cherax and burnet: proj/4.5.0 GDAL/1.4.1 grass/6.2.2 In general, available software packages are listed at: http://nf.apac.edu.au/facilities/software/index.php?l=&site=CSIRO and/or "module avail" and/or http://intra.hpsc.csiro.au/user/pkginfoweb/ [ page top ] 13. Found on the web http://www.llnl.gov/computing/tutorials/parallel_comp/ is a nice comprehensive introduction to parallel programming, which is becoming increasingly unavoidable for high performance computing. Thanks to Justin Baker for the link. Please send suggestions for the "found on the web" segment of the HPCbull to the editor: [ page top ]
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |