Introduction to Linux - A Hands on Guide | Linux Bible | Linux From Scratch | A Newbie's Getting Started Guide to Linux | Linux Command Line Cheat Sheet | More Linux eBooks



Saturday, 11 April 2015

How To: Speed Up File Transfers in Linux using RSync with GNU Parallel

    RSync or Remote Sync is the Linux command usually used for backup of files/directories and synchronizing them locally or remotely in an efficient way. One of the reasons of why RSync is preferred over all other alternatives is the speed of operation, RSync copies the chunk of data to other location at a significantly faster rate. This is because, whenever Rsync is executed for the very first occasion, it transfers all the data from source to the destination. On the next turn, it would just copy the files/directories whose contents are changed.


    Another plus point of using this utility is, as it makes use of SSH protocol to encrypt the data to be replicated, so it is much more secure and trustworthy. One more advantage of using Rsync is, as it performs compression of the data at source end and decompresses it at the destination, the bandwidth used during the sync operation will be considerably less. Also, the file permissions, their user/group information and the timestamps is/can be preserved.

    In one of our previous tutorial, we had thrown light on how to use RSync Command to Backup and Synchronize Files in Linux, please go through it once before proceeding.

    In order to rsync a huge chunk of data (containing considerably large number of smaller files), the best option one can have, is to run multiple instances of rsyncs in parallel. This seems to be pretty effective, but at the cost of high load average, more I/O oparations and network bandwidth utilization.

    So as to parallelize multiple rsync commands, one might use xargs or a series of rsync commands run in the background using &. But, over all of those alternatives, I would prefer GNU Parallel, a utility used to execute jobs in parallel. It is a single command that can replace certain loops in your code or a sequence of commands run in background.

Scenario:

  • I have to transfer a large amount of data (1.2 TB, but tested on 25GB) from my local server (say, 192.168.1.1) to a remote server (say, 192.168.1.2).
  • The data to be transferred is located at /data/projects and it needs to be copied in /data/ on the remote server.
  • The directory structure of the data to be copied must be preserved.
    (For Ex: The file /data/projects/proj1/sub_projA/test_file1 should be seen at /data/projects/proj1/sub_projA/test_file1 on the remote host.)

Solution:

1. Run the rsync --dry-run first in order to get the list of files those would be affected.

rsync -avzm --stats --safe-links --ignore-existing --dry-run  --human-readable /data/projects 192.168.1.2:/data/ > /tmp/transfer.log
2. I fed the output of cat transfer.log to parallel in order to run 5 rsyncs in parallel, as follows:

cat /tmp/transfer.log | parallel --will-cite -j 5 rsync -avzm  --relative --stats --safe-links --ignore-existing --human-readable {} 192.168.1.2:/data/ > result.log

Performance Improvements:

Following are the results when a small portion of data (2.5GB) is used to compare normal Rsync with parallelized Rsync:

Normal Rsync:


Rsync with GNU Parallel:


That's almost 25% (real) time saved, for a 5 core CPU (local & remote servers being containers) and 5 parallel jobs!

If you have more powerful server, you can increase the number of parallel jobs to be run (keep it max to twice the number of CPU cores) and make things much more faster. Just give it a try and let me know about your experience in the comments.

8 comments:

  1. Hi,
    Interesting but do you have some perf comparaison with and without parallel?

    ReplyDelete
  2. Added performance comparison in the article. Please check.

    ReplyDelete
  3. Hi,
    is it possible to skip the dry run and parallel the rsync?

    ReplyDelete
  4. The dry run part is time consuming (I have ~ 13 TB of data to transfer). Is it possible to skip it?

    ReplyDelete
  5. does this really always speed up things? from my understanding for every file to be transferred, another dedicated instance of rsync is being started , initiating another dedicated ssh connection to the remote end. at least this will create a massive additional context switching and cpu overhead on both ends and probably additional network traffic. i would guess it may speed up transfer if you have lots of often changing large files to transfer , but i guess it maybe even slower if you sync a larger directory tree with a larger number of changed small files.

    ReplyDelete
  6. I'm running into a problem with the output from the dry run- it looks like it produces relative paths, and I think I might need absolute.

    My dry run command:
    rsync -avzm --stats --safe-links --ignore-existing --dry-run --human-readable /some/dir/a/b/log user@remote:/remote/dir/ > result.log

    The result.log:

    building file list ... done
    log/
    log/file1.txt
    log/file2.txt
    log/file3.txt
    log/file4.txt


    Using this results in file not found errors. Shouldn't the file list look like this:
    building file list ... done
    /some/dir/a/b/log/
    /some/dir/a/b/log/file1.txt
    /some/dir/a/b/log/file2.txt
    /some/dir/a/b/log/file3.txt
    /some/dir/a/b/log/file4.txt


    Thank you

    ReplyDelete
    Replies
    1. Please check with the option '--relative' in rsync command.

      Delete