Hash collision: "NO" means "YES"
32-bit Python 2.x (19)
hash(w*9)%537105043
RSA uses a semiprime modulus, and that makes it secure, so using one with my hash algorithm should surely make it even better!1
This is a pure math function, works for all strings (hell, works for any hashable Python object), and doesn't contain any conditionals or special-casing! 32-bit Python can typically be called as python-32
on most systems that have both installed2.
I've tested this, and it returns 18,278 different values for the 18,279 3-letter-or-less uppercase strings. Assigning this to a function takes 11 more bytes:
h=lambda w:hash(w*9)%537105043
and h('YES') == h('NO') == 188338253
.
64-bit Python 2.x (19)
hash(w*2)%105706823
Same deal as above.
To come up with these numbers, a little bit of modular math was used. I was looking for a function f
and a modulus n
such that hash(f('YES')) % n == hash(f('NO')) % n
. This is equivalent to testing that n
divides d = hash(f('YES')) - hash(f('NO'))
, i.e. we only have to check the factors of d
for suitable values of n
.
The ideal n
is in the neighbourhood of 20000**2 to reduce the chance of a birthday paradox collision. Finding a suitable n
turns out to be a bit of trial and error, playing with all the factors of d
(there usually aren't many) and different choices for the function f
. Notice though that the trial and error is only needed because I wanted to make n
as small as possible (for golfing). If that wasn't a requirement, I could just choose d
as my modulus, which is usually sufficiently large.
Note too that you can't pull this trick off using just f(s) = s
(the identity function) because the rightmost character of the string has essentially a linear relationship (actually an XOR
relationship) with the final hash (the other characters contribute in a much more nonlinear way). The repetition of the string therefore ensures that the differences between the strings are amplified to eliminate the effect of changing only the rightmost character.
1 This is patent nonsense.
2 Python string hashing depends on major version (2 vs 3) and bitness (32-bit vs 64-bit). It does not depend on platform AFAIK.
Perl, 53 49 40 bytes
sub h{hex(unpack H6,pop)-20047||5830404}
Test:
h('YES') = 5830404
h('NO') = 5830404
Keys: 18279
Values: 18278
The hash values for YES
and NO
are the same and there are 18279 strings ^[A-Z]{0,3}$
, which are collision free except for the only collision for YES
and NO
.
Ungolfed:
sub h {
hex(unpack("H6", pop())) - 20047 || 5830404;
# The argument is the first and only element in the argument array @_.
# "pop" gets the argument from array @_ (from the end).
# The first three bytes of the argument or less, if the argument
# is shorter, are converted to a hex string, examples:
# "YES" -> "594553"
# "NO" -> "4e4f"
# Then the hex string is converted to a number by function "hex":
# 0x594553 = 5850451
# 0x4e4f = 20047
# The value for "NO" is subtracted, examples:
# case "YES": 5850451 - 20047 = 5830404
# case "NO": 20047 - 20047 = 0
# If the argument is "NO", the subtraction is zero, therefore
# 5830404 is returned, the result of "YES".
}
# Test
my %cache;
sub addcache ($) {$cache{$_[0]} = h($_[0])}
# Check entries 'YES' and 'NO'
addcache 'YES';
addcache 'NO';
print "h('YES') = $cache{'YES'}\n";
print "h('NO') = $cache{'NO'}\n";
# Fill cache with all strings /^[A-Z]{0-3}$/
addcache '';
for my $one (A..Z) {
addcache $one;
for (A..Z) {
my $two = "$one$_";
addcache $two;
for (A..Z) {
my $three = "$two$_";
addcache $three;
}
}
}
# Compare number of keys with number of unique values
my $keys = keys %cache;
my %hash;
@hash{values %cache} = 1 x $keys;
$values = keys %hash;
print "Keys: $keys\n";
print "Values: $values\n";
Older version, 49 bytes
Since the new algorithm is slightly different, I keep the old version.
sub h{($_=unpack V,pop."\0"x4)==20302?5457241:$_}
Test:
h('YES') = 5457241
h('NO') = 5457241
Keys: 18279
Values: 18278
Ungolfed:
sub h {
$_ = unpack('V', pop() . ($" x 4);
# pop(): gets the argument (we have only one).
# $" x 4: generates the string " " (four spaces);
# adding the four spaces ensures that the string is long
# enough for unpack's template "V".
# unpack('V', ...): takes the first four bytes as
# unsigned long 32-bit integer in little-endian ("VAX") order.
$_ == 20302 ? 5457241 : $_;
# If the hash code would be "NO", return the value for "YES".
}
Edits:
- Using
"\0"
as fill byte saves 4 bytes in comparison to$"
.
Python: 63
An incredibly lame solution:
def h(s):
try:r=int(s,36)
except:r=0
return(r,44596)[r==852]
It works by interpreting alphanumeric strings as base-36 numbers, and returning 0 for everything else. There's an explicit special case to check for a return value of 852 (NO), and return 44596 (YES) instead.