tar + rsync + untar. Any speed benefit over just rsync?
When you send the same set of files, rsync
is better suited because it will only send differences. tar
will always send everything and this is a waste of resources when a lot of the data are already there. The tar + rsync + untar
loses this advantage in this case, as well as the advantage of keeping the folders in-sync with rsync --delete
.
If you copy the files for the first time, first packeting, then sending, then unpacking (AFAIK rsync
doesn't take piped input) is cumbersome and always worse than just rsyncing, because rsync
won't have to do any task more than tar
anyway.
Tip: rsync version 3 or later does incremental recursion, meaning it starts copying almost immediately before it counts all files.
Tip2: If you use rsync
over ssh
, you may also use either tar+ssh
tar -C /src/dir -jcf - ./ | ssh user@server 'tar -C /dest/dir -jxf -'
or just scp
scp -Cr srcdir user@server:destdir
General rule, keep it simple.
UPDATE:
I've created 59M demo data
mkdir tmp; cd tmp
for i in {1..5000}; do dd if=/dev/urandom of=file$i count=1 bs=10k; done
and tested several times the file transfer to a remote server (not in the same lan), using both methods
time rsync -r tmp server:tmp2
real 0m11.520s
user 0m0.940s
sys 0m0.472s
time (tar cf demo.tar tmp; rsync demo.tar server: ; ssh server 'tar xf demo.tar; rm demo.tar'; rm demo.tar)
real 0m15.026s
user 0m0.944s
sys 0m0.700s
while keeping separate logs from the ssh traffic packets sent
wc -l rsync.log rsync+tar.log
36730 rsync.log
37962 rsync+tar.log
74692 total
In this case, I can't see any advantage in less network traffic by using rsync+tar, which is expected when the default mtu is 1500 and while the files are 10k size. rsync+tar had more traffic generated, was slower for 2-3 seconds and left two garbage files that had to be cleaned up.
I did the same tests on two machines on the same lan, and there the rsync+tar did much better times and much much less network traffic. I assume cause of jumbo frames.
Maybe rsync+tar would be better than just rsync on a much larger data set. But frankly I don't think it's worth the trouble, you need double space in each side for packing and unpacking, and there are a couple of other options as I've already mentioned above.
rsync
also does compression. Use the -z
flag. If running over ssh
, you can also use ssh's compression mode. My feeling is that repeated levels of compression is not useful; it will just burn cycles without significant result. I'd recommend experimenting with rsync
compression. It seems quite effective. And I'd suggest skipping usage of tar
or any other pre/post compression.
I usually use rsync as rsync -abvz --partial...
.
I had to back up my home directory to NAS today and ran into this discussion, thought I'd add my results. Long story short, tar'ing over the network to the target file system is way faster in my environment than rsyncing to the same destination.
Environment: Source machine i7 desktop using SSD hard drive. Destination machine Synology NAS DS413j on a gigabit lan connection to the Source machine.
The exact spec of the kit involved will impact performance, naturally, and I don't know the details of my exact setup with regard to quality of network hardware at each end.
The source files are my ~/.cache folder which contains 1.2Gb of mostly very small files.
1a/ tar files from source machine over the network to a .tar file on remote machine
$ tar cf /mnt/backup/cache.tar ~/.cache
1b/ untar that tar file on the remote machine itself
$ ssh admin@nas_box
[admin@nas_box] $ tar xf cache.tar
2/ rsync files from source machine over the network to remote machine
$ mkdir /mnt/backup/cachetest
$ rsync -ah .cache /mnt/backup/cachetest
I kept 1a and 1b as completely separate steps just to illustrate the task. For practical applications I'd recommend what Gilles posted above involving pipeing tar output via ssh to an untarring process on the receiver.
Timings:
1a - 33 seconds
1b - 1 minutes 48 seconds
2 - 22 minutes
It's very clear that rsync performed amazingly poorly compared to a tar operation, which can presumably be attributed to both the network performance mentioned above.
I'd recommend anyone who wants to back up large quantities of mostly small files, such as a home directory backup, use the tar approach. rsync seems a very poor choice. I'll come back to this post if it seems I've been inaccurate in any of my procedure.
Nick