rolling checksums in the rsync algorithm
(I'm working on a remote-synchronization problem myself right now, and @tonfa gives a good answer but I found their example a tad confusing. So I offer my own example, using Lorem Ipsum text as an example of the data being synchronized. I note that human-readable-text files can have their sentance/paragraph structure extracted for domain-specific synchronization optimizations, we aren't doing that: as far as rsync
and the rolling-hash algorithm is concerned, every file is an opaque binary blob without any structure)
Supposing a fileserver (Alice
) has this file 641 byte: lipsum.txt
:
(Assume ASCII or UTF-8 encoding. Line-wraps at column 80 have been added for readability, the line-breaks would not be present in the actual file)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla nisl enim,
consectetur quis quam consequat, pharetra tempus enim. Fusce iaculis libero
vitae ipsum accumsan efficitur. Fusce iaculis est et justo sollicitudin, sed
porttitor augue sagittis. Mauris aliquam nisl nibh, sed tempus magna venenatis
ac. Curabitur molestie nisl elit, suscipit egestas ex aliquam ac. Donec
dignissim, mauris nec malesuada pellentesque, ipsum sem porttitor est, quis
laoreet urna orci a leo. Cras tincidunt porttitor sapien, quis cursus metus
pulvinar id. Pellentesque nec mollis eros. Fusce sagittis vehicula ligula, nec
ullamcorper sapien sagittis non.
And supposing a remote client (Bob
) has a slightly modified copy of lipsum.txt
:
Proin finibus ullamcorper ante sit amet egestas. Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Nulla nisl enim, consectetur quis quam consequat,
pharetra tempus enim. Fusce iaculis libero vitae ipsum accumsan efficitur.
Fusce iaculis est et justo sollicitudin, sed porttitor augue sagittis. Mauris
aliquam nisl nibh, sed tempus magna venenatis ac. Nulla ex metus, malesuada
eget ultricies vel, fermentum quis nisl. Etiam ac venenatis tellus. Curabitur
molestie nisl elit, suscipit egestas ex aliquam ac. Donec dignissim, mauris nec
malesuada pellentesque, ipsum sem porttitor est, quis laoreet urna orci a leo.
Cras tincidunt porttitor sapien, quis cursus metus pulvinar id. Pellentesque
nec mollis eros.
Here's a visual diff
for reference:
With rsync, it doesn't matter which host (computer) is running the server or client software, what matters is who the sender is and who the receiver is (as an rsync server can send and receive, and an rsync client can also send and receive). For this demonstration, the remote client (Bob
) is sending its modified copy of lipsum.txt
to the rsync server (Alice
).
Part 1: Receiver (Alice) sends existing file's fixed block hashes to sender (Bob):
Alice breaks-up lipsum.txt
into equally sized blocks (for this demonstration lets use 32-byte sized blocks) and computes the weak hashes (checksums) of those blocks using the weak rolling hash of that block (note that the rolling-hash algorithm is not computed across block boundaries yet):
block offset 32-byte block bytes (as ASCII) block weak-hash (as UInt32)
--------------------------------------------------------------------------
0 |Lorem ipsum dolor sit amet, cons| 3262843843
16 |ectetur adipiscing elit. Nulla n| 1213926365
32 |isl enim, consectetur quis quam | 3307146210
48 |consequat, pharetra tempus enim.| 1702104107
64 |Fusce iaculis libero vitae ipsu | 113380311
80 |m accumsan efficitur. Fusce iacu| 3788835775
96 |lis est et justo sollicitudin, s| 2668366836
112 |ed porttitor augue sagittis. Mau| 212601872
128 |ris aliquam nisl nibh, sed tempu| 3149335490
144 |s magna venenatis ac. Curabitur | 1050020743
160 |molestie nisl elit, suscipit ege| 726010903
176 |stas ex aliquam ac. Donec dignis| 169282427
192 |sim, mauris nec malesuada pellen| 1806109673
208 |tesque, ipsum sem porttitor est,| 1415646245
224 | quis laoreet urna orci a leo. C| 1034619651
240 |ras tincidunt porttitor sapien, | 4085386299
256 |quis cursus metus pulvinar id. P| 4195748849
272 |ellentesque nec mollis eros. Fus| 3458665474
288 |ce sagittis vehicula ligula, nec| 3870297057
304 | ullamcorper sapien sagittis non| 2035682393
320 |.| 3014702
Alice
then sends these block offsets and weak-hashes to Bob
.
In this example, the block-size of 32 bytes with 32-bit sized hashes means Alice sends 84 bytes to Bob to represent a 641 byte-sized file - or about 13% of the size of the file that Bob wants to send. (Note that rsync
uses block sizes sized in the hundreds of bytes, so the ratio is even smaller in real-life, making it very practical for slow connection speeds).
Part 2: Sender (Bob) computes rolling hashes for a moving-window through their file:
First, Bob receives the hashes from Alice and loads them into a hashtable for O(1)
lookup (yes, this means there's a "hash-of-a-hash").
Then the real-magic happens: Bob computes the rolling-hash of the 32-byte-sized window through their file. Bob's file is 714 bytes long, which means 682 hashes will be computed. rsync
's weak-hash algorithm is specifically designed to work on a rolling-basis (such that you can push-and-pop individual byte values) instead of reiterating the window on every byte (i.e. rsync does it in O(n)
time instead of O(n * m)
time, where n
is the size of the file and m
is the window size).
This is what Bob generates (showing the first 100 blocks for brevity):
Offset| 32-byte block bytes (as ASCII) | Weak-hash
-------------------------------------------------
0 |Proin finibus ullamcorper ante s| 3787000889
1 |roin finibus ullamcorper ante si| 2983660626
2 |oin finibus ullamcorper ante sit| 1966279764
3 |in finibus ullamcorper ante sit | 663751685
4 |n finibus ullamcorper ante sit a| 3578072061
5 | finibus ullamcorper ante sit am| 757926908
6 |finibus ullamcorper ante sit ame| 1417546817
7 |inibus ullamcorper ante sit amet| 382274639
8 |nibus ullamcorper ante sit amet | 2866482182
9 |ibus ullamcorper ante sit amet e| 2888829949
10 |bus ullamcorper ante sit amet eg| 1780157435
11 |us ullamcorper ante sit amet ege| 763694078
12 |s ullamcorper ante sit amet eges| 1458113532
13 | ullamcorper ante sit amet egest| 2940865533
14 |ullamcorper ante sit amet egesta| 206310462
15 |llamcorper ante sit amet egestas| 539757628
16 |lamcorper ante sit amet egestas.| 1384516606
17 |amcorper ante sit amet egestas. | 1941179314
18 |mcorper ante sit amet egestas. L| 4284812189
19 |corper ante sit amet egestas. Lo| 3270773663
20 |orper ante sit amet egestas. Lor| 4045278126
21 |rper ante sit amet egestas. Lore| 2546076580
22 |per ante sit amet egestas. Lorem| 4200926111
23 |er ante sit amet egestas. Lorem | 4163898191
24 |r ante sit amet egestas. Lorem i| 2578123603
25 | ante sit amet egestas. Lorem ip| 2996308817
26 |ante sit amet egestas. Lorem ips| 4092857252
27 |nte sit amet egestas. Lorem ipsu| 1856506808
28 |te sit amet egestas. Lorem ipsum| 2946436023
29 |e sit amet egestas. Lorem ipsum | 1533676387
30 | sit amet egestas. Lorem ipsum d| 483134306
31 |sit amet egestas. Lorem ipsum do| 476908465
32 |it amet egestas. Lorem ipsum dol| 320277418
33 |t amet egestas. Lorem ipsum dolo| 3243707312
34 | amet egestas. Lorem ipsum dolor| 1496386478
35 |amet egestas. Lorem ipsum dolor | 3433761710
36 |met egestas. Lorem ipsum dolor s| 1480264640
37 |et egestas. Lorem ipsum dolor si| 1083640764
38 |t egestas. Lorem ipsum dolor sit| 232393675
39 | egestas. Lorem ipsum dolor sit | 3691055991
40 |egestas. Lorem ipsum dolor sit a| 4223208376
41 |gestas. Lorem ipsum dolor sit am| 1521290176
42 |estas. Lorem ipsum dolor sit ame| 4025158590
43 |stas. Lorem ipsum dolor sit amet| 1003228109
44 |tas. Lorem ipsum dolor sit amet,| 1503267718
45 |as. Lorem ipsum dolor sit amet, | 3582200626
46 |s. Lorem ipsum dolor sit amet, c| 3172993844
47 |. Lorem ipsum dolor sit amet, co| 3615230768
48 | Lorem ipsum dolor sit amet, con| 3300133744
49 |Lorem ipsum dolor sit amet, cons| 3262843843
50 |orem ipsum dolor sit amet, conse| 4056353756
51 |rem ipsum dolor sit amet, consec| 2500266960
52 |em ipsum dolor sit amet, consect| 2453212114
53 |m ipsum dolor sit amet, consecte| 3549891538
54 | ipsum dolor sit amet, consectet| 3963358169
55 |ipsum dolor sit amet, consectetu| 379456558
56 |psum dolor sit amet, consectetur| 3206351927
57 |sum dolor sit amet, consectetur | 2648968167
58 |um dolor sit amet, consectetur a| 3918072789
59 |m dolor sit amet, consectetur ad| 3511225284
60 | dolor sit amet, consectetur adi| 2267089856
61 |dolor sit amet, consectetur adip| 2145979408
62 |olor sit amet, consectetur adipi| 2017070101
63 |lor sit amet, consectetur adipis| 1143409689
64 |or sit amet, consectetur adipisc| 3407809552
65 |r sit amet, consectetur adipisci| 3452242954
66 | sit amet, consectetur adipiscin| 1381174278
67 |sit amet, consectetur adipiscing| 3012037709
68 |it amet, consectetur adipiscing | 863898618
69 |t amet, consectetur adipiscing e| 1819020278
70 | amet, consectetur adipiscing el| 1786383342
71 |amet, consectetur adipiscing eli| 3535604791
72 |met, consectetur adipiscing elit| 2849705034
73 |et, consectetur adipiscing elit.| 3906866187
74 |t, consectetur adipiscing elit. | 3028814790
75 |, consectetur adipiscing elit. N| 1042025376
76 | consectetur adipiscing elit. Nu| 3860663273
77 |consectetur adipiscing elit. Nul| 2577534005
78 |onsectetur adipiscing elit. Null| 3386641470
79 |nsectetur adipiscing elit. Nulla| 1332743216
80 |sectetur adipiscing elit. Nulla | 2328497122
81 |ectetur adipiscing elit. Nulla n| 1213926365
82 |ctetur adipiscing elit. Nulla ni| 2679376865
83 |tetur adipiscing elit. Nulla nis| 1645546481
84 |etur adipiscing elit. Nulla nisl| 1719798761
84 |etur adipiscing elit. Nulla nisl| 1719798761
85 |tur adipiscing elit. Nulla nisl | 3454143396
86 |ur adipiscing elit. Nulla nisl e| 3142519701
87 |r adipiscing elit. Nulla nisl en| 3224963982
88 | adipiscing elit. Nulla nisl eni| 579210117
89 |adipiscing elit. Nulla nisl enim| 907021266
90 |dipiscing elit. Nulla nisl enim,| 1625361309
91 |ipiscing elit. Nulla nisl enim, | 804916057
92 |piscing elit. Nulla nisl enim, c| 3002403667
93 |iscing elit. Nulla nisl enim, co| 1436224338
94 |scing elit. Nulla nisl enim, con| 1020398423
95 |cing elit. Nulla nisl enim, cons| 2978286423
96 |ing elit. Nulla nisl enim, conse| 62524249
97 |ng elit. Nulla nisl enim, consec| 2620984147
98 |g elit. Nulla nisl enim, consect| 1986661209
99 | elit. Nulla nisl enim, consecte| 460917591
You might notice something special about Bob's block 49: it matches Alice's block 0 exactly (both have the ASCII bytes for "Lorem ipsum dolor sit amet, cons
" but they also share the hash value of 3262843843
!).
Bob then continues, byte-by-byte, noting which blocks also match those in Alice's copy (this is possible in O(1)
time for each block, or O(n)
in total) because Bob loaded Alice's block hashes into a hashtable.
There also exists an optimization: if Bob sees one of his block matches, he can jump-forward to the next integral block and see if that matches and so skip the intervening blocks if it's a match (so we know the 32-byte block at offset 49 matches, so we can skip to block 81 - and block 81 also matches (with hash 1213926365
, so jump to block 113, and so on...) ).
Keep doing this for all windows in Bob and eventually you'll get this output:
Offset Bytes (as ASCII text)
----------------------------------------------
49 |Lorem ipsum dolor sit amet, cons|
81 |ectetur adipiscing elit. Nulla n|
113 |isl enim, consectetur quis quam |
145 |consequat, pharetra tempus enim.|
177 | Fusce iaculis libero vitae ipsu|
209 |m accumsan efficitur. Fusce iacu|
241 |lis est et justo sollicitudin, s|
273 |ed porttitor augue sagittis. Mau|
305 |ris aliquam nisl nibh, sed tempu|
463 |molestie nisl elit, suscipit ege|
495 |stas ex aliquam ac. Donec dignis|
527 |sim, mauris nec malesuada pellen|
559 |tesque, ipsum sem porttitor est,|
591 | quis laoreet urna orci a leo. C|
623 |ras tincidunt porttitor sapien, |
655 |quis cursus metus pulvinar id. P|
So that's 16 blocks (or windows) of 32 bytes each for 512 bytes total that Bob knows he doesn't need to send to Alice. Bob then only needs to send the non-overlapping windows that aren't covered by the blocks above:
Offset Bytes (as ASCII text)
----------------------------------------------
0 |Proin finibus ullamcorper ante s|
33 |t amet egestas. Lorem ipsum dolo|
330 |d tempus magna venenatis ac. Nul|
363 |a ex metus, malesuada eget ultri|
396 |ies vel, fermentum quis nisl. Et|
429 |am ac venenatis tellus. Curabitu|
Not bad, eh?
The receiver computes and sends rolling checksums only for non overlapping blocks. The sender on the contrary computes it for every possible block (but keep the result local). Then for the sender, it's just a matter of checking if one of the non overlapping block (sent by the receiver) match with any (overlapping) local block.
Your example is too simple to see anything interesting, the two last blocks simply won't match and will be sent for merging.
With a more interesting example (uppercase is a block):
sender:
A B Cabc D
receiver:
A B C D
The receiver will send the MD5 and rolling hash for A, B, C and D.
The sender will compute the rolling hash for every (overlapping) block, it will match for A, for B, for C and for D. Since abc
isn't match it will send it with the information where to merge it.