Does an identical cryptographic hash or checksum for two files mean they are identical?
When the hashes are identical, does this mean that the file contents are 1:1 the same?
All files are a collection of bytes (values 0-255). If two files MD5 hashes match, both those collections of bytes are extremely likely the exact same (same order, same values).
There's a very small chance that two files can generate the same MD5, which is a 128 bit hash. The probability is:
Probability of just two hashes accidentally colliding is 1/2128 which is 1 in 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456. (from an answer on StackOverflow.)
Hashes are meant to work in "one direction only" - i.e. you take a collection of bytes and get a hash, but you can't take a hash and get back a collection of bytes.
Cryptography depends on this (it's one way two things can be compared without knowing what those things are.)
Around the year 2005, methods were discovered to take an MD5 hash and create data that matches that hash create two documents that had the same MD5 hash (collision attack). See @user2357112's comment below. This means an attacker can create two executables, for example, that have the same MD5, and if you are depending on MD5 to determine which to trust, you'll be fooled.
Thus MD5 should not be used for cryptography or security. It's bad to publish an MD5 on a download site to ensure download integrity, for example. Depending on an MD5 hash you did not generate yourself to verify file or data contents is what you want to avoid.
If you generate your own, you know you're not being malicious to yourself (hopefully). So for your use, it's OK, but if you want someone else to be able to reproduce it, and you want to publicly publish the MD5 hash, a better hash should be used.
Note that it's possible for two Excel files to contain the same values in the same rows and columns, but for the bytestream of the file to be completely different due to different formatting, styles, settings, etc.
If you are wanting to compare the data in the file, export it to CSV with the same rows and columns first, to strip out all formatting, and then hash or compare the CSV's.
In practice, yes, an identical cryptographic hash means the files are the same, as long as the files were not crafted by an attacker or other malicious entity. The odds of random collisions with any well-designed cryptographic hash function is so small as to be negligible in practice and in the absence of an active attacker.
In general, however, no, we cannot say that two arbitrary files having the same hash definitely means that they are identical.
The way a cryptographic hash function works is to take an arbitrary-length input, and output a fixed-length value computed from the input. Some hash functions have multiple output lengths to choose from, but the output is still to some degree a fixed-length value. This value will be up to a few dozen bytes long; the hash algorithms with the longest output value in common use today have a 512-bit output, and a 512-bit output is 64 bytes.
If an input to a hash function is longer than the output of the hash function, some fidelity must be removed to make the input fit in the output. Consequently, there must exist multiple inputs of lengths greater than the length of the output, which generate the same output.
Let's take the current workhorse, SHA-256, as an example. It outputs a hash of 256 bits, or 32 bytes. If you have two files which are each exactly 32 bytes long, but different, these should (assuming no flaw in the algorithm) hash to different values, no matter the content of the files; in mathematical terms, the hash is a function mapping a 2256 input space onto a 2256 output space, which should be possible to do without collisions. However, if you have two files that are each 33 bytes long, there must exist some combination of inputs that give the same 32-byte output hash value for both files, because we're now mapping a 2264 input space onto a 2256 output space; here, we can readily see that there should, on average, exist 28 inputs for every single output. Take this further, and with 64-byte files there should exist 2256 inputs for every single output!
Cryptographic hash functions are designed such that it's computationally difficult to compose an input that gives a particular output, or compose two inputs that give the same output. This is known as preimage attack resistance or collision attack resistance. It's not impossible to find these collisions; it's just intended to be really, really, really, really hard. (A bit of a special case of a collision attack is a birthday attack.)
Some algorithms are better than others at resisting attackers. MD5 is generally considered completely broken these days, but last I looked, it still sported pretty good first preimage resistance. SHA-1 is likewise effectively broken; preimage attacks have been demonstrated, but require specific conditions, though there's no reason to believe that will be the case indefinitely; as the saying goes, attacks always get better, they never get worse. SHA-256/384/512 are currently still believed safe for most purposes. However, if you're just interested in seeing if two non-maliciously-crafted, valid files are the same, then any of these should be sufficient, because the input space is sufficiently constrained already that you'd be mostly interested in random collisions. If you have any reason to believe that the files were crafted maliciously, then you need to at the very least use a cryptographic hash function that is currently believed safe, which puts the lower bar at SHA-256.
First preimage is to find an input that yields a specific output hash value; second preimage is to find one input that gives the same output as another, specified input; collision is to find two inputs that yield the same output, without regard to what that is and sometimes without regard to what the inputs are.
All that said, it's important to keep in mind that the files may have very different data representations and still display exactly the same. So they can appear to be the same even though their cryptographic hashes don't match, but if the hashes match then they are extremely likely to appear the same.
It's a probability game... hashes are able to represent a finite number of values.
If we consider a hypothetical (and very weak) 8-bit hashing algorithm, then this can represent 256 distinct values. As you start to run files through the algorithm, you will start to get hashes out... but before long you will start to see "hash collisions". This means that two different files were fed into the algorithm, and it produced the same hash value as its output. Clearly here, the hash is not strong enough, and we cannot assert that "files with matching hashes have the same content".
Extending the size of the hash, and using stronger cryptographic hashing algorithms can significantly help to reduce collisions, and raise our confidence that two files with the same hash have the same content.
This said, we can never reach 100% certainty - we can never claim for sure that two files with the same hash truly have the same content.
In most / many situations this is fine, and comparing hashes is "good enough", but this depends on your threat model.
Ultimately, if you need to raise the certainty levels, then I would recommend that you do the following:
- Use strong hashing algorithms (MD5 is no longer considered adequate if you need to protect against potentially malicious users)
- Use multiple hashing algorithms
- Compare the size of the files - an extra data point can help to identify potential collisions, but note that the demonstrated MD5 collision did not need to alter the data's length.
If you need to be 100% sure, then by all means start with a hash, but if the hashes match, follow it up with a byte-by-byte comparison of the two files.
Additionally, as pointed out by others... the complexity of documents produced by applications such as Word and Excel means that the text, numbers, visible layout can be the same, but the data stored in the file can be different.
Excel is particularly bad at this - simply opening a spreadsheet saving it (having done nothing) can produce a new file, with different content.