Local Userguide for the CSIRO Data Store

Before Starting

About this Guide

This Userguide should be used alongside the HPCCC Userguide for the Altix system which provides the underlying host for the CSIRO ASC Data Store.

If you have loaded this guide using the index page you will have frames with a table of contents on the left and the userguide on the right.

The guide(s) will also work fine without frames but will not be as easy to navigate.

The guides are intended to be introductory in nature though they will provide references to core documentation and detailed site specific information and FAQs.

System overview

The CSIRO ASC Data Store is hosted on the CSIRO ASC Altix system, cherax, cherax.hpsc.csiro.au.

The ASC Data Store provides a UNIX/Linux file system which is managed by the SGI Data Migration Facility (DMF), which is a Hierarchical Storage Management (HSM) system. This provides the users with a file-system view of data, which looks just like a normal UNIX file system. However, the data blocks for the files themselves can be stored in multiple places - it is sometimes called a migrating file system, since the data blocks of files can migrate to different locations within the hierarchy.

The file system managed by DMF looks to be infinite to the users! This has enormous advantages when dealing with large data sets. In addition, large amounts of free disc space can be made available automatically within minutes. As a bonus, users do not have to be concerned with managing off-line copies of their data on decaying media or on systems that are ageing. The DMF system as run by ASC staff provides semi-automatic moving of data to newer media and systems as these become available. Access to data in the ASC Data Store has been provided continually in the same transparent manner since 14 November 1991, at four sites, on five different hosts, and through nine different tape media types.

(By contrast, data stored on off-line media tends to be dead or dying data - floppy discs, old magnetic tape media types, and even CDs and DVDs are all transitory media. Storing data in the ASC Data Store overcomes these problems.)

For some of the reasons for running an HSM, see

Why HSM

ASC Data Store: Getting Started

Management Policy

Upon registration, users are given the following note:
    For further information on CSIRO ASC systems please see the local 
    user guides at
    http://intra.hpsc.csiro.au/userguides/. In particular, please read 
    the 'Filesystems' section for the host you will be using.

    It is imperative that you understand the file systems
    management policies on ASC systems, including:

      - automated flushing/removal of files
      - backups are limited to a few file systems
      - file migration to tape on the CSIRO ASC Data Store
      - no guarantee of any file recovery in the event of major disasters.

    Copies of important files should be kept elsewhere.

Users have in the past lost files because they were not aware of the policies that we use to manage file systems, particularly when they are close to filling.

About the CSIRO Data Store

Since February 2004, CSIRO ASC has operated an SGI Altix service to provide three functions:

The SGI Altix runs the Linux operating system. Most of the software familiar to desktop Linux systems is running on the Altix. If you want to use the system, and are unfamiliar with UNIX or Linux, please contact us for assistance.

Note that connection to the ASC Data Store is possible without logging in to cherax - the ASC Data Store file system is visible to several machines at the ASC Docklands site through the shell variable $STOREDIR. Upon request, it can be made accessible through the WWW at

http://www.hpsc.csiro.au/users/nexus_ident
and it can be accessed through SAMBA onto Windows systems at
\\cherax.hpsc.csiro.au\nexus_ident
For Windows access, files which are partially or fully offline will be shown by an icon which has a small black and white clock-face superimposed on the lower left-hand corner. This is intended to indicate that there may be a delay in accessing the contents of the file.

File Systems

Although the migrating $HOME file system is the major focus of this guide, there are other file systems hosted on cherax for various purposes - the appropriate file system should be used whenever possible. Variables are defined at login to refer to the parts of the filesystem available to users.

In the table below: 'properties' denotes the Management attributes of the underlying filesystem: back-up (b), quota (q), local disk (l), job-temporary (j), flush (f), by arrangement (a) and/or migrated (m).

Variable nameproperties purpose
$HOME m, q, b login settings and persistent backed up large capacity storage
$FLUSHDIR q, f working files (semi-)persistent between sessions.
Ensure that critical files left here are backed up elsewhere
$DATADIR q persistent files for use in multiple jobs
$TMPDIR q, j job-temporary files - automatic cleanup

Flushing is implemented on the $WORKDIR based on necessity but with a minimum lifetime of 7 days. Files newer than the minimum lifetime will never be (automatically) flushed.

The job temporary space $TMPDIR will have different names for interactive and batch sessions. System scripts will clean up on session logout, and file system monitoring and cron initiated jobs will be used to clean up if there are failures by the logout procedures.

On systems other than cherax, further areas will also be defined. In particular, the ASC clusters at Docklands have $STOREDIR defined for the NFS mount of the cherax $HOME, and $LOCALDIR for a job-temporary area on individual nodes of the cluster.

We have commenced moving significant shared data holdings out of users' home directories on the CSIRO Data Store, to enhance the access and management.

Data areas can be set up as users and groups desire. Please provide answers to the following:

  1. A name for the dataset/project
  2. The visibility required for the data: e.g. selected users within a group, users within an organisation, or users across organisations.
  3. The user name of the gatekeeper
  4. An estimate of the amount of data and number of inodes
  5. The module access required - part of a wider module, or separate
  6. Which users should have group access
  7. The name of any existing directory to be moved there. A link could be provided from the original location.

Monitoring Resource Usage

Use the quota command to see your current limits and your usage of on-line disc space and inodes (see below for more information).

Use the command df -H to see the total capacity and usage of all the file systems.

Reports on the ASC Data Store usage can be seen at

CSIRO ASC Data Store usage reports

These reports are updated monthly.

Status information about cherax, DMF, and the tape drives can be seen at the traffic lights at

Traffic lights

As well as basic status information, this gives access to incident reports (past and future downtimes), and further information, including tape drive details, the DMF status summary, and the DMF request queue.

File migration - a short introduction

The cherax home file system is subject to migration, controlled by a product called the Data Migration Facility (DMF) - this means that the home file system is virtually infinite.

There are three levels to the hierarchy on the CSIRO ASC Data Store.

  1. The /cs/datastore filesystem (6.6 Tbyte) on the primary disc area
  2. The secondary disc cache (12.6 Tbyte): there is no direct user access to this
  3. The robotic magnetic tape system. Data is stored on both T9940 and T10000 tape media.

When the primary disc area is filling (or when this task is scheduled to run in the evenings), the system makes copies of the data from files, normally onto two magnetic tapes, and may also make a copy into the second disc area ('the cache'). At a later stage, when the primary area continues to fill, the system selectively removes data from the primary disc, while continuing to make the files visible in the usual ways, e.g. to an ls command. (This process is called migration).

When the data is required again, DMF automatically recalls the data onto the primary disc. (This process is called recall). Users need to take no special action to make this happen - the only apparent change from a standard file system is that there is a delay - typically a few seconds to recall data from the cache, and a minimum of about one minute from tape.

From February 2005, the ASC Data Store policies were re-engineered as follows:

(In July 2009, out of the 15.1 million files stored on the /cs/datastore file system, 6.9 million (or 46%) were resident on the primary disc. Another 3.3 million files were on the cache disc.

So, once a file is first recalled, it will usually go into the disc cache where smaller files will stay around for a lot longer. Our current policy uses a weighting system that determines whether a file is flushed from the disc cache. The weightings are equal for a 2 Mbyte file that has not been accessed in 6 months and a 32 Gbyte file that has not been accessed in 1 day.

So, while the disc areas can hold only about 2% of total holdings, they do hold around 67% of all files.

There is a large amount of activity on the /cs/datastore file system underneath the CSIRO ASC Data Store. On some days, more than 7.5 Tbyte of data has been generated. While there is a "tendency" for DMF to keep newer files on-line, even a file 1 week old is no longer new when there is 20 Tbyte of newer data ahead of it, and it will be flushed off to reside on tape only.

From October 2005, to ease the load and delays, we changed the strategy to migrate far more off the primary disc overnight, to leave the primary disc about 50% empty to cope with incoming data during the day, and reducing the need to write data to tape during prime time - this has allowed better access to tape drives for recalls during prime-time. However, it leaves a longer window of vulnerability for data, when there is only one copy - on disc. Also, since mid-2009, the amount of data coming in has often exceeded 3 Tbyte during business hours, and data then needs to be written to tape as soon as possible to prevent the file system from filling. During business hours, recalls have precedence over writes to tape, but outside business hours, writes to tape have precedence, and so recalls can be long delayed.

More information about how to work with DMF is given in a later section.

File management - back-up, quotas, flushing, compression, consolidation and notification

Back-up

The $HOME file system is subject to back-up - a full back-up is made each Sunday morning, and an incremental every other day. However, the back-up makes copies of only the inodes (metadata) of all files, and the data of small files which are not to be migrated. There are not multiple copies of data going back in time, and the data from the larger files is not dumped.

However, when you delete or over-write a file which has been migrated by DMF, the file is only soft-deleted by DMF - the data is kept active in DMF, and can be recalled once the inode is restored from back-up. However, after 35 days, such files are hard-deleted, and become increasingly unrecoverable.

If you accidentally delete or destroy a file, please contact us as soon as possible - after 35 days, we may not be able to help you.

A good technique for increasing protection against error is to remove write permission from crucial files and directories, e.g.

        chmod -R a-w keep*

There are no backups of the $TMPDIR, $WORKDIR and $DATADIR areas.

See the later section for more information about backups and restores.

Quotas

Use the quota command to see your usage and the quota limits on each file system. Note that there are quotas on both the space occupied and the number of inodes (loosely, the number of files and directories). Note that the units of space are blocks of 1 kbyte.

Use the quota -s command to see a report using units which are easier to read than long strings of digits.

For the migrating file system, the quota command shows under the heading 'blocks' only the on-line blocks for files - that is, it does not count the total size of all files in the hierarchical system.

The migrating file system can manage vast quantities of data, but there are overheads, so we discourage the storage of very large numbers of small files, by imposing an inode quota. The default is 150,000 files on $HOME.

When you exceed a quota limit, you will need to take action to enable you to continue to be able to create files. You will sometimes receive a system-generated e-mail; for example:

X, you have 174k files in /cs/datastore on cherax which is over your quota. In 5 days, the system will automatically prevent you creating any more there until your holdings drop below the "soft limit" of 170k. See the line containing "*" below:

   cherax$ quota -s
   Disk quotas for user abc123 (uid 12345): (blocks in kiB unless "-s" used)
   Filesystem    blocks   quota   limit grace  files  quota  limit  grace
   /cs/data           0    191G    210G            1   200k   210k
   /cs/datastore   145G    ***     ***         174k*   170k   180k  5days
   /var             816    977M   1075M            2  10000  10500
   /work          1299M    191G    201G         5002   200k   210k

   *** Space quotas only apply to online files and
       are normally irrelevant for /cs/datastore

You can reduce your holdings by deleting files, moving them elsewhere, or collecting them together with tar or tardir.

Please email hpchelp@csiro.au if you have any queries, or need a quota increase to continue with your work; ASC staff will determine if your needs can be met.

Flushing

Flushing is implemented on the WORKDIR based on necessity, but with a minimum lifetime of 7 days (current setting). Files newer than the minimum lifetime will never be (automatically) flushed.

The ASC has instituted automated flushing on the $WORKDIR area on cherax and the burnet cluster. When space becomes low, the system will sort files and empty directories from oldest to youngest, and remove these starting from the oldest until sufficient space is available, or until it runs out of old files to flush. The automated flush process stops flushing at a set age, currently 7 days, so that files younger than 7 days are not automatically flushed).

A log file showing the status of flushing can be found in the file flush.status in the top-level of file system where the automatic flushing has been invoked.

Flushing of the $TMPDIR area should occur at the close of each session or batch job.

Please don't rely on files either remaining in these temporary areas, or being removed by systems processes or ASC staff. Your own clean-ups would be appreciated.

Compression

There are utilities to do file compression, e.g. gzip. However, we recommend that you do not use compression on files in the CSIRO Data Store, since the tape drives have hardware compression capabilities, and the compression is done 'for nothing' by the drives.

The only time we would recommend compression is prior to transmission over very slow network links, to reduce the data actually transmitted. However, the rsync utility can do compression automatically in that situation by using the -z option, and that is preferable.

Consolidation

The CSIRO ASC Data Store is engineered to cope with vast quantities of data, rather than vast numbers of small files. As mentioned earlier, there are additional overheads for every migrated file.

With the large-capacity T10000 drives, which read at 130 Mbyte/s (and faster with the average compression we see), then the recall of a 1 Gbyte file takes about 75 s to load, mount and position a tape, about 8 s to read the 1 Gbyte, and perhaps another minute to rewind, unload and put the tape away again.

So the tape read time is only about 2% of the total recall time for a 1 Gbyte file.

Please aim to have files around 1 Gbyte or bigger, or recall batches of files using the dmget command, so that multiple files are recalled from each tape.

Files much larger than 100 Gbyte would take 10s of minutes to read and need significant amounts of disk for working space. Be careful not to make files too large.

In some circumstances, there can be advantages in consolidating many files into an archive: typically a tar or zip command can be used. CSIRO ASC has a utility called tardir (see man tardir on cherax), which consolidates all the files in a directory into a tar file (called .tardir.tar), but also creates files containing a list of the files in the archive (.tardir.list), plus another file containing the output of a long listing of files in the directory (.tardir.ls). These lists, being small, are likely to stay on-line, and are therefore quickly accessible. The utility untardir can reverse the operation of tardir, while gettardir can extract specified files from a tardir archive.

An advantage of such an approach is that when you return to a directory that has been consolidated in this way, only one tape mount is needed to recall all the files, whereas if the files were left unconsolidated, there could be (in the worst case) as many tape mounts needed as files.

Notification

Since the $DATADIR areas have no automatic mechanism for dealing with the problem of the areas filling up, users of these areas will be subject to the 'name and shame' mechanism. When a $DATADIR area is close to filling (and ASC staff are available), then users will be notified by e-mail of their holdings, and asked to reduce them. Sometimes, the presence of dormant holdings will be highlighted.

Service levels

The Altix system forms the heart of the ASC CSIRO systems at Docklands. Its file storage capabilities are used by other systems (with NFS exports), and it is used to store back-ups of selected file systems for all the other CSIRO hosts at the site, and for some remote sites. It keeps the backups of the CSIRO home file system from the various clusters and other systems.

As such, continuous service on the Altix is very important. If you notice a problem, please contact us asap.

During the evening and night, DMF gives priority to the migration of newly created files. Recalls of older offline files back to disc may therefore take significantly longer at those times.

However, in order to provide upgrades, it will be necessary at times to have scheduled periods of unavailability. We will endeavour to have these periods outside normal business hours, but this depends on staff availability. We will try to align these outage periods with CSIRO IMT's scheduled outage windows: in particular, Wednesday afternoons from 16:00 local time and Saturdays.

Also, in order to support new applications on the system, we will, at times, reserve significant portions of the resources for these activities.

Getting the most from the migrating file system

The home file system on cherax provides automatic access to large quantities of data which can be stored at several levels in the hierarchy of storage repositories. This is managed by the Data Migration Facility (DMF).

The migrating file system can be un-nerving for first-time users: a simple command, e.g. to list a file, which the user expects to complete within a second (or never even thought about how long it would take!), may take a few minutes to complete because a file's data is deeper in the storage hierarchy. Unfortunately, the DMF system does not have the capability of prompting the user that a file recall is in progress, leading to a pause in processing.

DMF - hierarchies

The current system on cherax has the following storage levels:

The tape drives are attached to a StorageTek SL8500 tape library - this has a four robots, and capacity to store about 6000 tapes, giving a notional capacity of the system of over 6.0 Pbyte using only T10000 (model B drives) cartridges.

DMF - migration

At various scheduled times, and in particular when DMF detects that space in the primary disc is getting low, DMF makes copies of the data from files onto two magnetic tape cartridges, and may make a copy into the cache disc. For small files, one copy will be on a T9940B tape, and the other on a high-capacity T10000 tape. For larger files, both copies will be on (separate) T10000 tapes. Then a file is classed as being dual-state, rather than just on-line.

The file is then a candidate to have its data blocks removed from the primary disc, which happens when a further space threshold is reached. The file is then classed as off-line. In many cases, we keep the first 32 kbyte of a file on-line - the file is then classed as a partial file.

Note that directories are never migrated, and the metadata of a file (its size, ownership, name, etc) are always available. Also, since February 2005, files smaller than 2 Mbyte will sometimes have a copy either in the primary or the cache disc. Thus a copy of data from small files will either on-line, or may be able to be quickly recalled from the cache disc - typically in about 0.1 seconds.

DMF - recall

When a user needs a migrated file again, an attempt to read the file (from a command or program), will cause DMF to find a copy of the data blocks from the level which provides the fastest recall, and restore the data to primary disc. The command or program accessing the file then proceeds as normal. Users can recall groups of files with an explicit command - dmget - see below.

DMF - dm commands

The Data Migration Facility provides a suite of commands for users to use, to get more information from DMF, and to speed up some operations.

The DMF family of commands for users are:

On the Altix, man pages are available for all of these. In particular, the man page for dmget contains a lot of information about local enhancements.

Some of these commands could be made available in a client form on systems that have the migrating file system exported to them via NFS. We have not done this yet.

DMF - dmls command - list files

The ordinary ls command on the Altix preserves compatibility with other Linux/Unix ls commands, and does not show any information from DMF. However, the dmls -l command shows more details on the status of files, as an extra field just prior to the file name:

...
-rw-------    1 bel107   csssg     1035762 2008-08-20 13:29 (OFL) mig1
-rw-------    1 bel107   csssg     1035762 2008-08-20 13:29 (REG) unmig1
...

The status values are:

        Value   Description

        REG     File not managed by DMF
        MIG     Migrating file
        DUL     Dual-state file
        OFL     Offline file
        UNM     Unmigrating file
        NMG     Nonmigratable file
        PAR     Partial-state file 
        INV     DMF cannot determine the file's state
A REG file might be a directory, a small file which we never migrate, or a file which is yet to be migrated.

A Migrating file is one whose data blocks are being copied to secondary media, usually to tape.

A Dual-state file has copies of its data on both the primary file system and on a secondary medium, such as the disc cache and/or tape.

An Offline file does not have a copy of the data in the primary disc area. The file data is either on tape or on the data cache, or both.

A Partial file has some of its data in the primary disc area, and the rest off-line.
Note that for the CSIRO ASC Data Store, files in the data cache are considered "Offline", even though their retrieval is rapid.

DMF - dmget command - recall files

The dmget is probably the most useful action command for users - it requests DMF to immediately initiate the recall of a set of files. Doing this can speed up operations by orders of magnitude, because DMF collects all the requests into batches for each tape needed to be mounted.

For example, if you've been away from a directory for a while, and want to start working with the files again, upon login change to that directory and enter the command:

        dmget * &

to initiate the recall of all your files in that directory, and then go and have a cup of coffee! Better still, before that enter the command

        dmget --list * 
(using a locally-enhanced wrapper for dmget) to see how many tapes might need to be mounted to get back the files. You may not have time for the coffee.

The dmget command accepts a list of files as an argument, or will read file names from standard input, so it can be used in a pipe, e.g.

        ls -1 fil* | grep -v 2003 | dmget

Note: the recall of a file does not count as a file access - the access time is not updated - this preserves the POSIX functionality of the file system, despite the file system being multi-resident.

However, this means that when you recall a file with a dmget command, it can often be an immediate candidate for migration again - when the system looks for opportunities to create more free space, the dual-state recalled file is an easy target, with an old access time. In these circumstances, it is wise to use the dmget command with the additional flag -a, which causes DMF to update the access time of the file. For example, the command

        dmget -a files*

updates the access time and recalls files matching the pattern files*. (This obviates the need for a touch -a command on the files.)

The backgrounding of dmget recalls (with the &) is an important concept, as it allows other processing to continue while the dmget is being honoured. However, if there is a problem with the recall (e.g. maintenance on the tape library), then processing may block or fail. If you want to guarantee that a set of files is recalled and on-line, then the following style of commands is recommended, for dealing with batches of files.

        #!/bin/tcsh
        set last_batch=10
        set batch=1
        @ next = $batch + 1

        dmget -a file.$batch.* &
        dmget -a file.$next.* &

        while ($batch <= $last_batch)
          dmget -a file.$batch.*
          if ($status != 0) then
             echo "File recall problem with batch $batch - exit"
             exit
          else
             # Initiate the recall of the next batch.
             if ($next <= $last_batch) then
               dmget -a file.$next.* &
             endif
             # process the batch of files.

          endif
          @ batch++
          @ next = $batch + 1
        end
Another technique, matched to the local dmget wrapper, is shown in example 2 in the dmget man page on cherax, and may be even more efficient, particularly when the order of access to the files does not matter.
                dmget --list file1 file2 file3 file4 > $TMPDIR/lof
                dmget < $TMPDIR/lof &
                for f in `cat $TMPDIR/lof`; do
                     process_one_file $f
                     dmput -r $f
                done

This can be extended to batches of files.

If you want to dmget files not just in the current directory, but in the current directory and all its sub-directories, then use commands like:

find . -type f -print | dmget -a &
The local wrapper for dmget also accepts the argument --recurse, to ask for the recall of all files in the specified directory and sub-directories. For example:
        dmget -a --recurse mydir

Up to a point, recalls of bigger batches are better, because the system can optimise the retrievals - if more that one file is on a tape, then when the tape is mounted, all the files currently being retrieved from the tape can be recalled with only one tape mount.

However, in the extreme, attempting to recall too many files at once causes problems - there is a single queue of retrieval requests, and a request to retrieve lots will block recalls for other users for a long time. Setting off multiple dmgets can disadvanatage not only other users, but also your own recalls. As well, the freeing process for disc space could get blocked behind the recalls, and a disaster can follow.

Please be considerate with the dmget command. DMF maintains a simple queue of recall requests for all users - if you initiate the recall of thousands of files and tens of Gbyte, it will take some time, and other users will have to wait until your recalls are finished before they can recall even one file. The local wrapper to dmget breaks the requests up into smaller groups, which helps to allow other users access, but please be considerate.

The local dmget wrapper now provides a new --defer option. This allows you to specify files which you would like recalled for you overnight, for use the next day. There is no guarantee that these requests will be acted on, but if they are, then access (via a second normal dmget) ought to be much faster.

If the system is too busy these deferred requests will be ignored. This means that use of this option is not a replacement for a dmget just prior to file use, but it may well accelerate it.

The local wrapper has the important attribute of sorting the recalls into tape order, and attempts to process all recalls from the same tape as part of a single batch, to bring an extra level of efficiency to recalls by minimising tape mounts.

As well, a new parameter, --list is supported. If this is specified, the dmget wrapper will not initiate any dmgets, but will report on the order in which the files will be recalled, and the number of tapes required. This command will be useful when the files being recalled are to be input to another process, e.g. scp to another host.

When run interactively, the new wrapper attempts to give an indication of how much work has been requested and how long it might take:

 cherax$ dmget *
 You are recalling 12 of the 19 files specified.
 The oldest currently queued recall request has been waiting for 0h 2m
 5 tape mounts may be required.

After you have used dmget on a large number of files, or a large quantity of data, then when you have finished processing, please issue the command
   dmput -r files*
to release the disc copies, and make more on-line space available for yourself and other users.

Please consult with CSIRO ASC staff for advice before attempting to process very large quantities of data or very large numbers of files.

DMF - dmput command - migrate files

The dmput command can be used in several ways. Firstly, it can advise DMF that the specified files can be migrated. However, on the CSIRO ASC Data Store, migration is handled automatically at set times each day, or when space is becoming low, so this use of the dmget command is not necessary.

Secondly, if you know that you are unlikely to need access to a batch of files for a while, then the dmput command with the -r parameter advises DMF to remove the on-line data after the off-line copies are made - this can make more space available on-line for files belonging to you or your colleague.

Thirdly, and this is the most important use, if you are recalling large quantities of data, then to prevent this having a large impact on other users, please use a dmput -r on the retrieved files when you have finished with them. This will release the on-line disc space, and lessen the actions of the DMF free-space manager on your own and other people's files.

CSIRO ASC runs its own version of dmput, which, unlike the system version, allows users to dmput other people's file for which they have the appropriate access. This is particulalrly useful when one user is processing data belonging to another user, and needs to release the on-line copies to avoid making the second user hit a space quota limit.

You can instruct dmput to ignore unmigrated files (those in REG state) by using the -Q option. In conjunction with -r, this allows the space used by migrated files (in DUL and PAR states) to be freed to avoid quota restrictions while avoiding the premature migration of newly created files.

When you issue a dmput command, the default is to remove all the disc blocks of a file. If you want to keep the first part of the file on line, which is the default case for automatic space management by DMF on the /cs/datastore file system, then issue a command like:

	dmput -r -K 0:40959 file
See man dmput for more details.

DMF - dmfind command - find files

The ordinary find command on the Altix preserves compatibility with other Linux/Unix find commands, and does not show any information from DMF. However, the dmfind command with the -state parameter can be used to find files matching various statuses. The states are the same as shown by the dmls command. For example, to find all files that are migrating, off-line or partial in the current and lower-level directories, and initiate their recall, use:

        dmfind . -state MIG -o -state OFL -o -state PAR | dmget -a

DMF - dmcopy command - copy (part of) migrated file

The dmcopy command can be used to make copies of migrated files, and also has the capability to extract portions of a migrated file, and pack portions into other files. This could be extremely useful for some data extraction tasks. However, please consult ASC staff before using this command - there are some cautions.

DMF - dmattr command - list file attributes

This command lists various DMF-attributes of files. It can also be used to check on DMF status.

DMF - partial-state files

This feature allows DMF-managed files to have different residency states (online or offline) for different regions of a file. This means that a file can have one region that is online for immediate access and another region that is offline and will need to be recalled to online media in order to be accessed. DMF allows for up to 4 distinct file regions. A file which has more than one region is called a partial-state file. A region is simply a contiguous range of bytes which have the same residency state. The maximum number of 4 regions means that a file which is in a static state (not currently being migrated or unmigrated) can have a maximum of 2 online and a maximum of 2 offline regions.

A partial-state file is shown with the status 'PAR' in dmls -l listings, and can be matched with the dmfind command.

Currently, DMF is set to keep the first 32 kbyte of all new files in /cs/datastore on-line, although the threshold may changed in the future.

This feature will allow commands such as "file" to access details of a file without recalling the entire file from off-line media. It may also allow access to metadata in things like netCDF files without recalling the whole file. It is also useful in file browsers, allowing the file types to be discovered.

(This facility has been used to allow a crawler to traverse an OPeNDAP repository without recalling the entire contents of all the files).

Users can specify byte ranges for partial files in dmget and dmput commands - see the man pages. You can see the state of the regions of a file with the dmattr command: for example:

        dmattr -a nregn,regn,state -l file
might show:
     nregn : 2
      regn : 0  DUL  0:40959
      regn : 1  OFL  40960:EOF
     state : PAR
and dmattr -r -l file will show full details. Please note that the local dmget wrapper does support byte ranges.

DMF - status

There is a local command dmfstatus in /tools/ascutils/bin/dmfstatus on cherax. This provides, in answer to long-standing requests from users, a clue as to the load on DMF.

For example:

DMF queue last updated 2010-05-28 15:26:13+1000 Fri

DMF Status: Friday 2010-05-28 15:26

                   Recalls
          Current              Today
VolGrp Queued MiB-Queued  Total   MiB-Total HitRate
dcm         0        0.0   1416    494335.8     38
sec         7     5120.7   1702    632155.0     46
se2         4    38785.8    173   1365924.0      4
te2         1      420.0    390    173050.1     10
test        0        0.0      0         0.0      0
bu2         0        0.0      0         0.0      0
Total      12    44326.5   3681   2665465.0

                  Migrates
          Current              Today
VolGrp Queued MiB-Queued  Total   MiB-Total
dcm         1        0.6   1707    670421.9
sec         0        0.0      0         0.0
se2         1       33.4    736    841999.9
te2         0        0.0      0         0.0
test        0        0.0      0         0.0
bu2         0        0.0    905     25840.0
Total       2       34.1   3348   1538261.9
 
Longest recall wait  0:26:19
Tbyte total stored
2010-05-27 2306360767.939
End of report

This provides an instantaneous snapshot of the state of DMF recalls/migrates. The heading 'VolGrp' indicates volume groups - smaller files are written to sec and te2, other files to se2 and te2: and bu2 is used for the /backup* file systems. dcm is the disc cache.

This information is also available through the traffic lights (see reference above), along with details of the DMF request queue.

More about backups and restores

This section provides more information about the protection for your files offered through the backup and restore services.

Backups

On the HPCCC and CSIRO ASC systems, in general only the $HOME areas are subject to backup. This allows restoration of files that accidentally deleted or corrupted if:

We had an incident once when a restart of a job caused output files from the earlier run of the job to be over-written. We had hoped to restore earlier versions of the files from backups, but there were no backup copies, because files which are open to jobs (probably for writing) are not caught by most of the backup processes. This is an additional reason why jobs should run in areas like $WORKDIR, and copy important files back to $HOME on completion. The user asked us to restore files from before the rerun, but even though the files were in a backed-up area ($HOME), there were no backups.

Users are urged to make good use of the $TMPDIR, $WORKDIR and $DATADIR areas, but NOT rely on these areas to provide any backup or disaster recovery. If the hardware or software fails, or there is an administrator error, there can be no recovery.

Status of the backups for user file systems can be seen at:

CSIRO ASC file systems backup status

For the CSIRO systems, apart from the /cs/datastore (cherax $HOME), the backups are managed in a Tower of Hanoi pattern, whereby backup snapshots are kept from several points in time, but the separation between the points in time increases the further back you go. Here is an example of the coverage for the burnet cluster /cs/home file system.

2010-05-28 08:00:36+1000 Fri /backup1/burnet-fs1-dumps
systemcshome.20080409.seq.0 to set 0
systemcshome.20081228.seq.256 to set 9
systemcshome.20090919.seq.512 to set 10
systemcshome.20100210.seq.640 to set 8
systemcshome.20100414.seq.704 to set 7
systemcshome.20100430.seq.720 to set 5
systemcshome.20100516.seq.736 to set 6
systemcshome.20100520.seq.740 to set 3
systemcshome.20100523.seq.743.recycle to set 1
systemcshome.20100524.seq.744 to set 4
systemcshome.20100525.seq.745 to set 1
systemcshome.20100526.seq.746 to set 2
systemcshome.20100527.seq.747 to set 1
	Found 13 dumps as expected up to sequence number 747

Note that for most files, only one physical copy is kept.

For discontinued systems and services (e.g. the SX-6/TX7 system from the cessation of backups of /cs/home in May 2010), backups will be successively consolidation and removed if they are more than 3 years old. The last backup will be deleted 3 years after the end of service.

It is important to note that these backups of discontinued systems should never be relied upon as an archive of files. Users shodul ensure that files they need from discontinued systems are copied to somewhere for safe keeping prior to the end of the service.

For the /cs/datastore file system, (cherax $HOME), two full dumps (taken on Sundays), and incremental dumps for the last week are kept on-line in a system disc area, and copies of the dumps are kept on tapes for about 45 days. Some dump tapes are kept from further back in time.

For the /cs/datastore area, copies of the full dumps are sent offsite weekly. These dumps hold around 45% of the files (the small ones), but less than 1% of the data. THERE ARE NO SECOND-SITE COPIES FOR OVER 99% OF THE DATA, AND USERS SHOULD PROVIDE EXTRA PROTECTION FOR IMPORTANT FILES. Each day, the incremental dump of the /cs/datastore area is also sent off-site, and three days' dumps are kept there.

The off-site information also includes dumps of the DMF information, allowing restorations from old data bases, and dumps of the /backup areas. Dumps are taken off-site in a Tower of Hanoi system, with coverage back for a month or two. For example:

 
workbackup_stagecurrent.20100425.seq.64/
workbackup_stagecurrent.20100627.seq.72/
workbackup_stagecurrent.20100725.seq.75.recycle/
workbackup_stagecurrent.20100801.seq.76/
workbackup_stagecurrent.20100808.seq.77/
workbackup_stagecurrent.20100817.seq.78/
workbackup_stagecurrent.20100822.seq.79/

As well, we have older dump tapes back to early 2009.
Tape   Date      Sequence number
                      Set number
G61009 2009-03-15       0 0
G61006 2009-10-25        32 6
G60575 2010-02-21        48 5
G61003 2010-04-18        56 4
G60581 2010-05-30        64 7
G60583 2010-06-27        68 3
G60593 2010-07-04        69 1
G60588 2010-07-11        70 2
G61007 2010-07-25        71 1

Tapes containing second copies of files from the Bureau SAM-FS systems are taken offsite regularly.

Restores

Around July 2009, a user asked, how long will files be recoverable if a user deletes a file? An answer follows.

For any file system, if a user accidentally deletes a file before any dump copies are made, there is no recovery process available that we can provide. Dumps are done only once per day (or less frequently for some areas).

For this reason, if you are working intensively on a file, e.g. source code, it is very wise to make second copies on another system regularly, and not over-write all the past copies. Use a subversion repository to provide greater protection.

For the main cherax $HOME file system, if a user deletes a non-migrated file (i.e. small or new), recovery is easily done if the file existing at the time of a previous early-morning dump, and not otherwise.

For the main cherax $HOME file system, if a user deletes a migrated file (or a file that had previously been migrated), it is only soft-deleted, and not hard-deleted out of the databases for at least 35 days. During the 35 days, the file is easily recoverable (if the metadata was caught on an early-morning dump) and this requires restoring the metadata about the file from a dump.

After that, copies of the files will remain on tape for a long time, but will be hard to find without old copies of the DMF databases that were live when the file was last alive. We now keep copies of these databases back in time at ever expanding intervals, so we might in theory be able to restore a file from further back in time if the tapes haven't been merged, and we still ahve the file caught in an old version of the databases.

We have copies (offsite) of the databases and dumps as shown in the example in the previous section, and on-site for all of the last 35 days.

When we do backup of a remote file system from cherax, e.g. the burnet cluster or the Clayton tardis cluster, we keep backups in a Tower of Hanoi pattern, with increasing distance between dumps as you go back in time. All of the files captured in these dumps are readable and recoverable. For example, here are sample holdings of the files from the tardis system:


2010-05-28 08:00:39+1000 Fri /backup2/backup/tardis-dumps
With up to 1 recycled directories to be kept.

cshome.20081217.seq.0 to set 0          
cshome.20090829.seq.256 to set 9         
cshome.20100104.seq.384 to set 8         
cshome.20100310.seq.448 to set 7         
cshome.20100411.seq.480 to set 6         
cshome.20100427.seq.496 to set 5         
cshome.20100513.seq.512 to set 10         
cshome.20100521.seq.520 to set 4         
cshome.20100523.seq.522.recycle to set 2         
cshome.20100524.seq.523 to set 1         
cshome.20100525.seq.524 to set 3         
cshome.20100526.seq.525 to set 1         
cshome.20100527.seq.526 to set 2         
	Found 13 dumps as expected up to sequence number 526

However, we have set the Hierarchical Storage Management of these areas to make a copy of the data for any file to only one rather than two tapes (since there is already a copy at the source), and have also set these backups up so that if file is the same between any of these dumps, we have only one physical copy. We don't make a new physical copy for every dump cycle - only for new incoming files.

Users don't have access to these backup areas, but it would be feasible to provide this so that users could see their holdings.

So, if a user deletes a file locally, recovery depends on a copy of the file having been caught in the backup process at a schedule like that above, and on the user noticing the deletion in time. The longer the file existed before the deletion, the better the chance of recovery. In the above set of dumps, the first one, labelled set 0, is intended to be kept for as long as we are running this backup scenario.

Storage terminology

A note about storage terminology (backup, archive, migration, etc) can be found at

Storage Terminology

ASC Data Store History

CSIRO had been using a Cray UNICOS platform from March 1990 to April 2004, and had great service from the three Crays (Y-MP2/216 SN1409, Y-MP4/64 SN 1918, and J90se/162048 SN 9730), the UNICOS operating system, the compilers tools and libraries, and the Data Migration Facility (since 14 November 1991). This is a long reign for any platform in CSIRO for large-scale computing, probably equalled only by the Csironet CDC 3600 from about 1964 to 1977, or the CDC Cyber 76 from about 1973 to 1985.

In February 2004, CSIRO ASC brought an SGI Altix 3700 system into service to provide three functions:

In mid-2004, the system was upgraded to also provide:

In August 2008, the Altix 3700 was replaced by an Altix 4700 with the following hardware:

In June-July 2009, the two StorageTek T9840C magnetic tape drives were de-commissioned from DMF, the Powderhorn tape library was de-commissioned and removed, and four T10000B drives were added, eventually to replace the T10000A drives. The T10000B magnetic tape drives provide 130 Mbyte/s, 1 Tbyte per cartridge, and 1 minute average access time after tape load.

In January 2010, 90 Tbyte of disc and SSD was delivered to upgrade the ASC Data Store disc.

In February 2010, the copying of all the data from T10000A format to T10000B format was completed, and the T10000A drives de-commissioned.

In May 2010, 4 more T10000B tape drives were delivered.

CSIRO user data from the file system set up on the first CSIRO Cray in March 1990 has been carried forward to the current /cs/datastore files system, with interruptions only for the weekend transitions between sites (3 times). See the Monday Mail

article

from March 2010 for a historical perspective on the CSIRO ASC Data Store and the underlying hosts.

Due to user requests in 2004 that the new ASC Data Store host retain the previous Cray names cherax, the SGI Altix is known as cherax.hpsc.csiro.au. The name was originally chosen as a play on cherax - the scientific name for the yabby, the Australian Cray :-). Although it is no longer pertinent as a pun, cherax will continue to crunch data for CSIRO. The SGI system, like all the previous Crays, was manufactured in Chippewa Falls, Wisconsin, USA, a place associated with supercomputing from the beginning.

We initially pronounced cherax with a starting sound like "chips", but at the opening of the first CSIRO Cray on 23rd March 1990, Barry Jones remarked that "since the name is derived from the Greek, the pronunciation is cherax, as in chemistry" (with a hard "k" sound). We stood corrected.

So far, CSIRO staff of the ASC and predecessors have managed the CSIRO ASC Data Store facility with very few file losses over a period of more than 18 years. Users have been insulated from media changes for their data storage during this period, because of the near-line nature of the storage system.

This contrasts with off-line data storage, which is highly dependent on human effort to copy data to new media types. The computer and storage industries are replete with data media and formats which are now obsolete, e.g. paper tape, punched cards, reel-to-reel magnetic tape, floppy discs, Exabyte tapes, optical discs. CDs and DVDs will be the next fading storage medium. This leads to the mantra among ASC staff: "off-line data is dead data, or at least dying."

Key dates

Ancient History

Notably, CSIRO (Division of Computing Research) had an HSM called the Document Region on its CDC 3600 from the 1960s until it was de-commissioned in 1977: it used operator-mounted 7-track magnetic tape. CSIRO (Csironet) had an automated tape store (Braegan/Calcomp ATL) using 6250 bpi 9-track tapes hosted on Fujitsu systems from about 1980 to 1990, with a Terabit File Store built upon this, but without automatic migration.

Known Problems

Changelog

Here is a list of recent updates in this userguide for quick reference for users returning to this guide.

To Do

Here is a list of pending updates to this Userguide.

Last updated: 2010-05-28 16:13:31+1000 Fri