How has Hash changed in 11.3?
I guess the reason for that has nothing to do with the "MD2"
or "MD5"
sum, which hasn't changed. The reason is that the input you are feeding has changed. To create a hash from an expression, Mathematica needs to convert the expression into a serialized form that can be hashed.
Details
Before we consider, what changed between version 11.2 and 11.3, let us first take a look at what happens when hashing expressions. As I said, when hashing an expression, it needs to be converted into bytes that are then hashed. Kuba pointed out the following information from the documentation
Trying this in version 11.2 gives 3 different hashes
e = {};
str = ToString[FullForm[e]];
bytes = StringToByteArray[str];
Hash[{}, "SHA1"]
Hash[str, "SHA1"]
Hash[bytes, "SHA1"]
(* 623731294305055464111497716552294880943817129159 *)
(* 574126028549497177030915842606154604998516284935 *)
(* 763583645927281726897351532107330233032372985102 *)
In version 11.3, at least the str
and bytes
hash is the same on my machine but only the str
hash matches between the versions. This gives the clue, that we need to look at how bytes are constructed from an expression.
I'll leave the digging as an exercise for the interested reader. Between 11.2 and 11.3, a great deal of the hash implementation has changed. One of those changes is in
ImportExport`HashDump`exprToString
This function is responsible for turning an expression into a string. While in version 11.2 it only has one parameter: the expression, in version 11.3 it takes a second boolean argument LegacyQ.
Here is, what 11.2 gives me for {}
ImportExport`HashDump`exprToString[{}]
(* "ÑJ ¾þ\.1eQcb\.16,kïMq\.17¹ \.12½\.1ca\[CenterDot]+?Ýg=ÉeList[]" *)
It contains a prefix string that is followed, indeed, by the full-form of {}
. This prefix is the reason, why simply hashing the FullForm
gives a different result. In version 11.2, you can therefore use
Hash[{}, "SHA1"]
Hash[ImportExport`HashDump`exprToString[{}], "SHA1"]
(* 623731294305055464111497716552294880943817129159 *)
(* 623731294305055464111497716552294880943817129159 *)
to get a string that hashes to the same value as the expression itself. In 11.3, things are more complicated since a string is further mangled by a
FromCharacterCode@ToCharacterCode[str, "UTF-8"]
step that changes the prefix. But this is not important right now, because the main difference is, that in 11.3 the transformation from expression to string changed. The standard Hash
call does not use LegacyQ, but we can call the inner function and show the difference:
ImportExport`HashDump`toString[{}, #] & /@ {True, False}
(* {"List[]", "System`List[]"} *)
So 11.3 includes the context in the FullForm. This is the reason for the difference in the hashes when using expressions.
Solution
In version 11.2 we had
Hash[{},#]&/@{"CRC32","SHA1","MD5"}//Column
(*
2594213469
623731294305055464111497716552294880943817129159
272934427398090264974473461931457450337
*)
In version 11.3 you can achieve the same behavior by creating the string form in legacy mode and converting it correctly to a byte-array
hash[expr_, args__]:=With[{str=ImportExport`HashDump`exprToString[expr,True]},
Hash[StringToByteArray[str,"ISO8859-1"],args]
]
hash[{},#]&/@{"CRC32","SHA1","MD5"}//Column
(*
2594213469
623731294305055464111497716552294880943817129159
272934427398090264974473461931457450337
*)
There is already discussion about String
and ByteArray
in the linked previous Q & A, so I'll comment a bit about general expressions.
This only concerns the named hash algorithms like "MD5"
or "SHA"
etc.
The single argument form Hash[expr]
, which is equivalent to Hash[expr, "Expression"]
is completely separate and based on the internal representation of expr
. It has not been changed for 11.3.
Using @Kuba's example of a very simple expression, in 11.3 we have the following new hash value
Hash[{}, "MD5"]
(* 68244457821771821570522625853545795031 *)
The previous hash value can still be obtained using
Developer`LegacyHash[{}, "MD5"]
(* 272934427398090264974473461931457450337 *)
Why the change? The previous scheme for converting expressions (to strings) for hashing had a number of severe problems.
For example, it did not take into account the contexts of symbols so it could happen that the same expression would hash to a different value because of a different $ContextPath
, it had issues with evaluation leaks etc.
These have now been addressed, but the fixes mean the hash values would inevitably change. Developer`LegacyHash
is provided for people who in some way depend on the old hash values.
The wording in the documentation isn't fully accurate, because ToString[FullForm[expr]]
is not used literally.
What is actually true is that the input given to the hashing algorithm is based on the bytes of ToString[Unevaluated[FullForm[expr]]]
, where all symbols in expr
are qualified with their full contexts.
Furthermore, a constant 32-byte sequence prefix is added for (non-String and non-ByteArray) expressions to avoid collisions -- this ensures that the number 2
and the string "2"
do not end up having the same hash value. This is because ToString[Unevaluated[FullForm[2]]]
is the same as the string "2"
but 2
and "2"
are different expressions.
Below is a mock-up example (not the actual implementation) that could be used to replicate the 11.3 hash value even on earlier versions. It uses the byteHash
utility defined in my previous answer.
prefix = {209, 74, 9, 190, 254, 30, 81, 99, 147, 98, 22, 44, 107, 239, 77, 113,
23, 185, 9, 18, 189, 28, 97, 183, 43, 63, 221, 103, 61, 127, 201, 101};
byteHash[Join[prefix, ToCharacterCode["System`List[]"]], "MD5"]
(* 68244457821771821570522625853545795031 *)
While the documentation ideally should give some idea of what serialization is used for general expressions, I would not hold the expectation that it must go into any deep level of detail or provide sufficient information to actually write an alternative implementation. Besides, the serialization could conceivably change some day again.
I think the moral is, if people want full control, they should themselves create a sequence of bytes to give as input to the hashing method in whatever way they see as appropriate. Then, what a named algorithm like "MD5"
or "SHA"
must return is fully determined, and a result different from that would certainly be a bug.