|
Bulletin 125 - 2004 Sep 16
From Tuesday 7th September for about 6 days, there was a problem with the i/o on the Bureau SX-6 nodes - many nodes showed high i/o rates to the $WORKDIR areas, and we had times when the nodes had a full workload, but showed idle time of between 90 and 100%. The problem was eventually tracked down to the use of the GFS file systems for accessing direct-access files. The codes had followed our earlier advice (see HPCbull 97.7) and set export F_SETBUF20=200000 export F_HSDIR=20 for the i/o unit accessing the direct-access file. This selects a buffer size in memory of 200 Mbyte. (Such a setting avoids the default setting of the buffer size equal to the record size, which can lead to small blocks of i/o using the low-performance NFS protocol instead of the high-performance GFS capability.) Unfortunately, the application had grown, and the files being accessed were now bigger than the buffer size setting of 200 Mbyte. Instead of the entire file being loaded into the memory buffer once at the beginning of a run, and written out once at the end (if changed), a buffer of 200 Mbyte was being re-read every time a record was needed which was outside the current buffer. This resulted in enormous amounts of i/o being done to access records in the file. Some programs read 150 times as much data from disc as was actually used. The solution is to ensure that the buffer size is bigger than the file size when using the F_HSDIR directive for fast direct-access i/o on the GFS discs. Do not hesitate to use memory to gain performance, especially when your application is multitasking. Remember on an SX-6 node you can use about 6 GB per processor without creating an anti-social job, whether as program, data space, or even I/O buffers assigned through the SETBUF facility. Note that resource limits are queue dependent. Would all users please re-check their codes, to make sure buffer sizes are being set for all large i/o streams, and that the enhanced version of the netCDF libraries are being used with suitable buffer sizes. The HPCCC support staff can assist. 2. Correction to the SX-6/TX7 qstat command optionIn HPCbull 123.5, there was an error in the new qstat command option for the SX-6/TX7 system. The option is -c 1 (one, not 'ell'), for the option to see accumulated processor time for jobs. 3. SGI Altix upgradeSixty-four new processors (1.3 GHz) have arrived for the SGI Altix. There will be some interruptions over the next few weeks to allow these processors to be installed to replace the 900 MHz processors. 4. VPAC Summer Internships (CSIRO)Once again, CSIRO is looking for projects from users suitable for Summer Internees at VPAC, the Victorian Partnership for Advanced Computing. Last year, two projects were successfully completed, involving improvements to a data management project, and conversion of a program to OpenMP for multi-processing. If you have a project which might be suitable, please contact Gareth Williams on 03 9669 8114, gareth.williams@csiro.au .
|
|
Comments to: © Copyright 2010, CSIRO Australia Use of this web site and information available from it is subject to our Legal Notice and Disclaimer and Privacy Statement |