Can you find out how big the changes are by comparing two hashes?
No, at least with a good hash function.
You can test this yourself by creating a hash over a specific data set, and then a modified hash over a different data set. You will see that every bit of the resulting hash function has about a 50% chance of flipping.
I'll demonstrate this by creating the SHA-256 hash of the string MechMK1
:
$ echo -n "MechMK1" | sha256sum
2c31be311a0deeab37245d9a98219521fb36edd8bcd305e9de8b31da76e1ddd9
When converting this into binary, you get the following result:
00101100 00110001 10111110 00110001 00011010 00001101 11101110 10101011
00110111 00100100 01011101 10011010 10011000 00100001 10010101 00100001
11111011 00110110 11101101 11011000 10111100 11010011 00000101 11101001
11011110 10001011 00110001 11011010 01110110 11100001 11011101 11011001
Now I calculate the SHA-256 hash of the string MechMK3
, which changes one bit of the input:
$ echo -n "MechMK3" | sha256sum
3797dec3453ee07e60f8cf343edb7643cecffcf0af847a73ff2a1912535433cd
When converted to binary again, you get the following result:
00110111 10010111 11011110 11000011 01000101 00111110 11100000 01111110
01100000 11111000 11001111 00110100 00111110 11011011 01110110 01000011
11001110 11001111 11111100 11110000 10101111 10000100 01111010 01110011
11111111 00101010 00011001 00010010 01010011 01010100 00110011 11001101
I compared both results and checked how often a bit differed from both hashes, and exactly 128, or 50% of all bits differed. If you would like to play around with this yourself and see what kind of results you get, I created a simple C program that does exactly that.
TL:DR; In Cryptographic hash functions; the hashes of any two distinct messages should appear statistically independent.$
I realize that the hash is a one way function and that the changes in hash are suppose to tell us that the original data has changed (that the entire hash changes on even the slightest changes to data).
Avalanche Criteria, apart from being one-way, is also what we want from good Cryptographic hash functions;
a single bit change in the input results in changes in each of the output bits with a 50% probability.
multiple bits changes: this is a bit tricky, If we consider the hash functions archives to model a pseudorandom function according to the random oracle model then we can consider each input bit change, on average, with 50%, and that doesn't matter how much bit is changed.
One can see this by considering one bit, and flipping a coin if Head comes flip and if Tail comes don't flip 50% of flipping. Now, toss another coin and do the same. The result is the same (simple math).
Of course, we cannot achieve the random oracle model. Therefore, the output bits are not independent of each other. They seem to be as long as one can find a distinguisher and that would constitute a cryptanalytic attack against the hash function. Once one found for a good cryptographic hash function, you will see it in the news.
Proving that a hash function has Avalanche Criteria is a statistical process that you need to test many random input values. Not all inputs and bit complements result in half of the bit changed and this is not the expected behavior. You also need to show that the output bits are changed randomly.
If not satisfied this hash function can fail to satisfy pre-image resistance, 2nd-preimage resistance, and collision resistance *.
- preimage-resistance — for essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output, i.e., to find any preimage
x'
such thath(x') = y
when given any y for which a corresponding input is not known. - 2nd-preimage resistance, weak-collision — it is computationally infeasible to find any second input which has the same output as any specified input, i.e., given
x
, to find a 2nd-preimagex' != x
such thath(x) = h(x')
. - collision resistance, strong-collision — it is computationally infeasible to find any two distinct inputs
x
,x'
which hash to the same output, i.e., such thath(x) = h(x')
.
Failure of each can cause attacks, and if it is successful then this can be devastating. An example; consider someone finds a second message to your original message that has the same has value (or the hash of the Linux CD ISO's);
This is a signed message representing the payment is $1.00, have a nice day
I will pay you $1,000,000.00 have a nice day
Hopefully, even SHA-1 and MD5 are resisting this attack. Therefore you can assume that there is a change in the data if the hash value changes. The probability that a random text will have the same hash with your value will be negligible.
But is there a way to find out to what degree has the original data changed when two hashes are different?
Hopefully, not. If there is a single bias that gives information about the changes that can be used by clever attackers.
* This are formal definitions and taken from rom Rogaway and Shrimpton seminal paper Cryptographic Hash-Function Basics:...
$ Thanks to FutureSecurity for the simplification
As the other answers have already noted, the answer is "no" for cryptographic hash functions. These are generally designed to behave as much like a perfectly random function as possible, and any detectable similarity in the hash outputs generated for similar inputs would also allow the hash to be distinguished from a random function.*
However, there are other kinds of hash functions, such as locality-sensitive hashes, for which the answer can at least be "yes, sometimes".
In particular, locality-sensitive hashes typically feature properties such as "any two inputs differing by at most δ according to some similarity metric will, with probability p > 0, have hashes that differ by at most ε(δ) by some other (possibly the same) similarity metric." Typically, the distance metric for the hashes may be something like Hamming distance, while the corresponding metric for the inputs might be e.g. edit distance. The choice of a suitable locality-sensitive hash function mainly depends on which particular distance metric you're interested in.
*) Technically, the classical definition of a secure cryptographic hash only requires collision resistance and first and second preimage resistance. I don't see any obvious way to prove that a hash function could not have these properties while also being locality-sensitive in some way, although they do impose some rather significant constraints. In particular, the number of hash outputs within a distance of ε(δ) from any given hash output H(x) would have to grow faster than the number of other inputs within distance δ of the corresponding input x for any reasonable values of δ, as otherwise simply testing a bunch of similar inputs would very likely yield a collision. In any case, I'm not aware of any locality-sensitive hash functions that would meet even this weaker definition of cryptographic security, and I have no idea what such a hash might look like if it existed.