|
Bulletin 115 - 2004 May 26
The HPCCC plans to shutdown the SX-5 systems, florey and russell, on
Monday 31st May at 4 pm. Please ensure you have all the files you need from the SX-5s stored elsewhere. (CSIRO users please see item 5 below). 2. Welcome to Polly MorganCSIRO HPSC and the HPCCC welcome Polly Morgan to the 24th Floor. Polly comes from Monash University, and will be working in the systems administration area. She will be initially concentrating on the configuration of the new CSIRO cluster systems, as well as administration of the HPCCC systems and the Altix. 3. Scheduling - continuing changes to get the best resultHPCCC staff are watching carefully the scheduling of jobs on the SX-6s. The initial concentration is on the best possible performance of the operational jobs. This is sometimes at odds with getting the best throughput for all other jobs. For example, at the extreme, we could reserve enough nodes for exclusive use for the operational work, but would then lose throughput for other work. We are trying to share nodes to some extent, but are working on the impact on operational jobs of having other work at lower priority sharing the nodes. Possible contention is a major issue - memory access and i/o are the obvious ones. We are awaiting a new capability (scheduling by requested number of CPUs) from NEC. (In the meantime, we are using memory-resident file-system (MRFS) allocation as a pseudo for the CPU allocation, by defining a default MRFS allocation for some queues, and load balancing using that pseudo-allocation.) 4. Variation in execution times - memory access contention.We have seen jobs on the SX-6s vary in their user-CPU time by up to 60%. We believe that this could be caused by memory contention with other jobs on a node. 5. SX-5 files mirrored on cherax for CSIRO usersPlease note that the area /cs/data/SX5userdata on cherax should be treated by users as a READ-ONLY area. Please don't create files in there, and for the few users on the SX-5s still, don't delete files on the SX-5 in the expectation that they will remain in /cs/data/SX5userdata: they won't!. The mirroring process twice per day makes an exact image of each user's files from florey:/cs/home onto cherax:/cs/data/SX5userdata . This includes deleting all files not in florey:/cs/home . At the close of the SX-5 service, we plan to move each CSIRO user's SX-5 files from the location cherax:/cs/data/SX5userdata/group/csabc into the directory ~abc123/SX-5 for each user on cherax (provided that directory does not already exist), so that the files will be under each users' control. If you do not want this action, plese let us know. 6. Use of the do_tx7 scriptThe do_tx7 script provides convenient functionality for scripts running on the SX-6s, to allow commands to be executed on the TX-7s where more appropriate or essential. For example, the wider networks are not visible from the SX-6s, and you might need to execute something like: do_tx7 rcp file gale:However, the do_tx7 command has some overhead, and it may be better in some cases to have the do_tx7 command run a script on a TX7 than have a multitude of do_tx7 commands. For example, use: do_tx7 sh << EOF rcp $HOME/file1 host1: rcp file2 host2:dir2 rcp file3* host3:dir3 EOF or do_tx7 "rcp $HOME/file1 host1:; rcp file2 host2:dir2; rcp file3* host3:dir3" rather than do_tx7 rcp $HOME/file1 host1: do_tx7 rcp file2 host2:dir2 do_tx7 rcp file3* host3:dir3 Users would be advised to check the status of any critical commands. For very small files, it will be more efficient to do commands like a copy on the SX-6s themselves than to use do_tx7 to intiate the copy on a TX7. 7. cherax failure, and stress on DMFcherax crashed between 09:30 and 09:40 on Friday 21 May, and around 10:00 on Tuesday 25th May. We don't know why yet. Service was restored within 30 minutes. Also on Friday afternoon, the /cs/datastore file system filled for a brief time. We had hoped that this would not occur on the new system. We have lowered some thresholds, to make a re-occurrence less likely, and also requested an enhancement from SGI. On Monday 24 May, there was a heavy load of retrievals, which meant that some requests were delayed. We will consider re-introducing the wrapper script for dmget, to automatically break large recall requests into smaller bunches, to allow a better response for all users. We have also requested an enhancement from SGI. Finally, one user found that after recalling files, they were being re-migrated again rapidly. This is because file migration is based on the file size and access time, and recalling a file does not count as an access. If you are recalling large numbers of files in batches with the dmget command, then we recommend that you judiciously use the touch -a command to update the access time of the files, e.g. touch -a files* 8. Newer versions of software on cherax, and the pkgenv command We have put some new software on cherax - along with pkgenv scripts to set up the environment. If all is working well, the following will set up your path to use the obvious packages: pkgenv python-2.3.3 Also, we have installed a new autoconf and some related tools in /tools/gnu: pkgenv gnu Users are invited to test these new versions. Note that if pkgenv does not work for you in a given shell you can enable it via: source /usr/local/etc/pkgenv.csh or . /usr/local/etc/pkgenv.sh In general, we plan to leave the released versions of software from SGI in the normal locations, and provide newer versions through the pkgenv facility, as above.
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |