|
Bulletin 197 - 2010 February 2
1. Bureau of Meteorology Sun Constellation system (solar) The Sun Constellation system (solar) at the Bureau of Meteorology is being brought into production use, with work at present concentrated on porting the operational suite to the system. The system is expected to be available for general access by the end of February. The SUN Constellation consists of 576 nodes, each with 2 quad-core Intel 64-bit Xeon processors (code named Nehalem), totalling 4608 CPU cores. Each node has 24 Gbytes of main memory and 24 Gbytes of flash memory instead of local disc. All of the nodes are connected by a dual-rail Infiniband network, with data rates of 40 Gbit/s per connection. The system runs the CentOS distribution of Linux, Sun Grid Engine for job management and uses the Lustre global file system comprising of 115TB of disk space. In addition, there are 4 user login nodes, and 6 data-mover nodes. [ page top ] 2. Getting help with solar - solarhelp@bom.gov.au A Userguide is being prepared, and is accessible at: Users requiring assistance with solar should log an incident report using cSupport:
Problems reported via a web browser or email will be entered in the cSupport incident tracking system, and then can be made visible to the staff most able to solve the problem. You can use the web interface at http://helpdesk1.bom.gov.au/User to check progress of a problem and/or follow up the request by replying to email sent to you about the request. Please only include immediately relevant history in your reply as otherwise the information is duplicated in the system and the problem becomes difficult to follow. If there is an urgent query out of hours, please contact Bureau operations staff on 03 9669 4006 who will contact appropriate support personnel. [ page top ] 3. CSIRO Data Store usage report upgrade Users can see reports on the Data Store use at http://intra.hpsc.csiro.au/user/usage/ds/ Reports are available for each group on the Data Store. Group names correspond to rather old names for CSIRO Divisions and special purpose groups. We have added a new field to the reports - the average file size. We would appreciate it if users aimed for larger rather than smaller file sizes. There is a large overhead in the recall of small files from tape. With the large-capacity T10000 drives, which read at 130 Mbyte/s (and faster with the average compression we see), then the recall of a 1 Gbyte file takes about 75 s to load, mount and position a tape, about 8 s to read the 1 Gbyte, and perhaps another minute to rewind, unload and put the tape away again. So the tape read time is only about 2% of the total recall time for a 1 Gbyte file. Please aim to have files around 1 Gbyte or bigger, or recall batches of files using the dmget command, so that multiple files are recalled from each tape. [ page top ] 4. CSIRO ASC - 'procs' batch system resource A new feature in the batch system has been enabled on cherax, burnet and the gpu cluster (linuxgpu). There is now a 'procs' resource which you can use to simply request the number of cpu-cores that your (MPI) jobs need. The 'procs' may be allocated on any mix of nodes (burnet and linuxgpu only - cherax is a single node) which may result in large jobs starting sooner, but increases the possibility of contention with other jobs. Note: You can still use the 'nodes' and 'ppn' resource syntax. This may be useful on burnet and linuxgpu to ensure that the system assigns the desired cpu cores per-node, if required. [ page top ] 5. CSIRO ASC - requesting resources on the GPU cluster The scheduler configuration has been changed on the gpu cluster to define two gpu resources per node (up from one which was set as a temporary measure). The gpu resources are configured as generic countable resources (gres), which the scheduler assigns on a per-task basis. At this time we don't recommend that users run concurrent separate gpu jobs on a node, so for simple serial gpu jobs it is important to request both gpus with gres=gpu:2 qsub -l procs=1,gres=gpu:2 or qsub -l nodes=1,gres=gpu:2 For jobs needing gpus on multiple nodes one of the following would be appropriate (assuming 3 nodes). If your code can only handle 1 gpu per node, request both: qsub -l nodes=3:ppn=1,gres=gpu:2 Or if you can use both gpus (with one core each to drive them) qsub -l nodes=3:ppn=2,gres=gpu or if you can use extra cores and just want the whole node, you don't need to (and can't!) request gpus: qsub -l nodes=3:ppn=8 Other cases are possible if we get to a situation where concurrent gpu jobs on a node are OK, but these are the main expected cases for now. [ page top ] One of the system scripts on cherax is often raising an error message Segmentation fault since the system software upgrade in November 2009. The cause of this is not yet known. If any user sees this error, please contact us: it is probably not a problem with your script. [ page top ] 7. CSIRO ASC New and Upgraded software
[ page top ]
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |