Parallelise rsync using GNU Parallel
Following steps did the job for me:
- Run the
rsync --dry-run
first in order to get the list of files those would be affected.
$ rsync -avzm --stats --safe-links --ignore-existing --dry-run \
--human-readable /data/projects REMOTE-HOST:/data/ > /tmp/transfer.log
- I fed the output of
cat transfer.log
toparallel
in order to run 5rsync
s in parallel, as follows:
$ cat /tmp/transfer.log | \
parallel --will-cite -j 5 rsync -avzm --relative \
--stats --safe-links --ignore-existing \
--human-readable {} REMOTE-HOST:/data/ > result.log
Here, --relative
option (link) ensured that the directory structure for the affected files, at the source and destination, remains the same (inside /data/
directory), so the command must be run in the source folder (in example, /data/projects
).
I personally use this simple one:
\ls -1 | parallel rsync -a {} /destination/directory/
Which only is useful when you have more than a few non-near-empty directories, else you'll end up having almost every rsync
terminating and the last one doing all the job alone.
Note the backslash before ls
which causes aliases to be skipped. Thus, ensuring that the output is as desired.
I would strongly discourage anybody from using the accepted answer, a better solution is to crawl the top level directory and launch a proportional number of rync operations.
I have a large zfs volume and my source was was a cifs mount. Both are linked with 10G, and in some benchmarks can saturate the link. Performance was evaluated using zpool iostat 1
.
The source drive was mounted like:
mount -t cifs -o username=,password= //static_ip/70tb /mnt/Datahoarder_Mount/ -o vers=3.0
Using a single rsync
process:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/ /StoragePod
the io meter reads:
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.61K 0 130M
StoragePod 30.0T 144T 0 1.62K 0 130M
This in synthetic benchmarks (crystal disk), performance for sequential write approaches 900 MB/s which means the link is saturated. 130MB/s is not very good, and the difference between waiting a weekend and two weeks.
So, I built the file list and tried to run the sync again (I have a 64 core machine):
cat /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount.log | parallel --will-cite -j 16 rsync -avzm --relative --stats --safe-links --size-only --human-readable {} /StoragePod/ > /home/misha/Desktop/rsync_logs_syncs/Datahoarder_Mount_result.log
and it had the same performance!
StoragePod 29.9T 144T 0 1.63K 0 130M
StoragePod 29.9T 144T 0 1.62K 0 130M
StoragePod 29.9T 144T 0 1.56K 0 129M
As an alternative I simply ran rsync on the root folders:
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/Marcello_zinc_bone /StoragePod/Marcello_zinc_bone
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/fibroblast_growth /StoragePod/fibroblast_growth
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/QDIC /StoragePod/QDIC
rsync -h -v -r -P -t /mnt/Datahoarder_Mount/Mikhail/sexy_dps_cell /StoragePod/sexy_dps_cell
This actually boosted performance:
StoragePod 30.1T 144T 13 3.66K 112K 343M
StoragePod 30.1T 144T 24 5.11K 184K 469M
StoragePod 30.1T 144T 25 4.30K 196K 373M
In conclusion, as @Sandip Bhattacharya brought up, write a small script to get the directories and parallel that. Alternatively, pass a file list to rsync. But don't create new instances for each file.