How to generate short fixed length cryptographic hashes?
The cryptographical hash function with output size n
has √n
collision with probability 50% due to the birthday paradox.
You took 3 hexadecimal digits which means you get 12 bits. After 26=64 hash generations, you will see a collision with 50% probability.
Collision in hash functions is inevitable due to the fixed-sized output space but the arbitrary length of input space. Pigeonhole Principle tells us that collision is inevitable.
If you want low collision probability you have to use larger output as 128-bit. For 128-bit we expect at least one collision in 264 hashes with 50% probability.
We are not accepting that the system is secret, we assume it is known. E-Mails are not secret and any attacker can create his code by using your source code. An attacker or a group of attackers can try to attack your system due to short token. The mitigation is increasing the token size.
For most applications using read("/dev/urandom", 16)
will be fine or you can use HMAC($email, $key)
without trimming. Copy and paste shouldn't be a problem for the user.
If you need higher resistance you may use digital signatures; first hash the e-mail then sign.
There are a couple of pretty serious problems with this approach which will increase the risk of collisions beyond what would ordinarily be expected for a random 4-digit value.
- Take the first 3 characters of the digest
- Convert them to an integer
Three hexadecimal digits gives you a range of 0 – 4095 (0x000 - 0xfff
). This will give you a lot more collisions than expected for a 4-digit number, because you're only using 40% of the possible values.
Use a larger excerpt of the digest and use a modulus operation (% 10000
) to force the result to be four decimal digits long. For numerical reasons, using the first 10 hexadecimal digits (40 bits) is pretty close to ideal. (Multiples of 10 bits give you rough powers of 1000, and 40 bits is large but below the maximum safe integer of 253.)
- If length < 4, concatenates one or multiples 0 at the end
This introduces even more collisions. Values from 1 - 4, 10 – 40, and 100 – 400 all get mapped onto the same 1000 – 4000 range, and there's some secondary collisions for sets of values like 5, 50, and 500.
If you need to pad with zeroes, do it on the left side, not the right.