Using rsync for file transfer
What is rsync?
We recommend the use of rsync for transferring files
between systems - it has many advantages in terms of performance
and data availability, especially when you are doing bulk transfers.
Some advantages are:
- rsync is a multithreaded copy mechanism and will spawn
multiple threads depending on the size of the file tree being
copied over.
- rsync will optimise the volume of data being copied,
especially when files on the destination has been already
transfered and remain unchanged. In that case rsync does not
copy the same file, saving time.
- rsync will do incremental copies avoiding the need to copy
identical files every time the copy command is issued
- rsync may be resumed from the stage where it has been interrupted.
Rsync works by using ssh (or rsh) to start another instance of rsync at the remote
end in a server mode. This has a number of implications:
- rsync must be installed at both source and destination hosts (and must be
found in your path - but see the --rsync-path option).
- if using rsh it must be setup for password-less operation - but...
- ssh is usually used as a replacement for rsh. With ssh
you may be prompted for ssh authentication if required (and you have
a terminal or ssh-askpass). A particular instance of ssh (say
hpn-ssh)
or ssh with particular options (say hpn-ssh null encryption) can be specified
(use option -e).
Examples
The most common use is to transfer/synchronise
a whole directory structure as in the following examples.
rsync -av remote-host:src/mydir /data/dest/
This would recursively transfer/update all files from the directory
src/mydir on the machine
remote-host into the
/data/dest/mydir directory on the local machine. The files are
transferred in "archive" mode, which ensures that symbolic links, devices,
attributes, permissions, ownerships, etc are preserved in the transfer (where
possible).
Or for synchronization in the other direction (to a remote host):
rsync -av src/mydir remote-host:/data/dest/
The basic syntax for rsync is the same as for scp/rcp.
For example, to transfer files to a directory (if necessary) use rsync commands like:
rsync files user@remote-host:mydir/
rsync user@remote-host:'file1 file2' mydirhere/
would replace scp commands:
scp files user@remote-host:mydir/
scp user@remote-host:'file1 file2' mydirhere/
Note. The trailing '/' is optional, but makes it clear that the destination is a
directory.
Some more options
- Try the dry-run option -n to see what files would be transferred
before trying some examples.
- The --whole-file option is important when tranferring files into
a HSM system (such as cherax) where you don't want to recall files from tape to
compare content - just check times and sizes.
- The -av --stats options are useful when doing bulk transfers
and backups. Add the --delete --delete-after when you need to sync
and create a mirror of a directory on a remote computer.
- The exclude options --exclude=PATTERN allow you to recursively
synchronize a directory structure except files that match a particular
pattern (see the man page for the pattern details - based on shell
meta-characters). Multiple exclude options are allowed.
- There is an option to select bandwidth limiting, to stop a transfer
saturating a shared link. This can be useful for doing low priority
background transfers while leaving some bandwidth for interactive
usage. Use the option --bwlimit=KBPS to limit I/O bandwidth (KBytes
per second)
- rsync can also do pipelining of file transfers to minimise latency
costs - this could be useful for transfers across the continent.
- The -z option causes compression to be used to reduce the size of
data portions of the transfer. However, we recommend not using the compress
option except over very slow links - try some experiments - a transfer using the
compress option was over an order of magnitude slower between HPCCC machines.
Unattended rsync.
We have a solution for unattended rsync over ssh using key-based
authentication with forced commands. This can be used to allow clients to only
be able to retreive updates of particular data from a user account with no
possibility of shell access. Contact us if you need this facility.