Choosing the right algorithm in HashBytes function
The HASHBYTES
function only takes up to 8000 bytes as input. Because your inputs are potentially larger than that, duplicates in the range of the field that gets hashed will cause collisions, regardless of the algorithm chosen. Carefully consider the range of data you plan to hash -- using the first 4000 characters is the obvious choice, but may not be the best choice for your data.
In any event, because of what a hash function is, even if the inputs are 8000 bytes or less, the only way to ensure 100% correctness in the results is to compare the base values at some point (read: not necessarily first). Period.
The business will dictate whether or not 100% accuracy is required. This will tell you that either (a) comparing the base values is required, or (b) you should consider not comparing the base values -- how much accuracy should be traded off for performance.
While hash collisions are possible in a unique input set, they are infinitesimally rare, regardless of the algorithm chosen. The whole idea of using a hash value in this scenario is to efficiently narrow down the join results to a more manageable set, not to necessarily arrive at the final set of results immediately. Again, for 100% accuracy, this cannot be the final step in the process. This scenario isn't using hashing for the purpose of cryptography, so an algorithm such as MD5 will work fine.
It would be extremely hard for me to justify moving up to a SHA-x algorithm for "accuracy" purposes because if the business is going to freak out about the miniscule collision possibilities of MD5, chances are they're also going to freak out that the SHA-x algorithms aren't perfect either. They either have to come to terms with the slight inaccuracy, or mandate that the query be 100% accurate and live with the associated technical implications. I suppose if the CEO sleeps better at night knowing you used SHA-x instead of MD5, well, fine; it still doesn't mean much from a technical point of view in this case.
Speaking of performance, if the tables are read-mostly and the join result is needed frequently, consider implementing an indexed view to eliminate the need to compute the entire join every time it's requested. Of course you trade off storage for that, but it may be well worth it for the performance improvement, particularly if 100% accuracy is required.
For further reading on indexing long string values, I published an article that walks through an example of how to do this for a single table, and presents things to consider when attempting the full scenario in this question.
MD5 should be fine and the output can be stored in a binary (16). The probability of a collision (see birthday paradox) is still very low, even with a large physical sample size. The output of SHA-1 takes 20 bytes and the output of SHA-256 takes 32 bytes. Unless you have such a large number of records that your birthday collision probability becomes significant (physically impossible or at least impractical with current hardware technologies) it will probably be OK.
I would go with SHA-1 it is the better one of the available algorithms and has the least collision expectency out of all of them (2^51 compared to MD5 which is 2^20.96). MD5 has also been proven to be vulnerable to collisions in certain scenarios.
Sources:
http://en.wikipedia.org/wiki/SHA-1 http://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions#Cryptanalysis http://en.wikipedia.org/wiki/MD5