Un-hash the Java Hash
C#, 1.052 × 1010
The idea is to do the hash function in reverse to get all possible un-hashed strings from the given number. Integer overflow of the 32-bit hash value is also emulated in reverse by incrementing the given hash value by 2^32 on each iteration.
UPDATE: I've made several micro-optimisations that significantly improved the performance of my program. The new version produced 10.52 billion results in 15 minutes (compared to 9.05 billion in 1 hour for the previous version).
Futher improvement may be difficult since the output file was 117GB in size and that's about how much free space I have on my SSD. Writing results to the screen doesn't seem to be a viable option since it's almost a thousand times slower than writing to a file.
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
namespace Unhash
{
class Program
{
const int minChar = 32;
const int maxChar = 126;
// limit maxStrLength to make sure that max hash value does not exceed long.MaxValue (2^63 - 1)
const int maxStrLength = 12;
static long[] minHashValues;
static long[] maxHashValues;
static Program()
{
minHashValues = new long[maxStrLength + 1];
for (int strLength = 0; strLength <= maxStrLength; strLength++)
minHashValues[strLength] = Pow(31, strLength) / (31 - 1) * minChar;
maxHashValues = new long[maxStrLength + 1];
for (int strLength = 0; strLength <= maxStrLength; strLength++)
maxHashValues[strLength] = Pow(31, strLength) / (31 - 1) * maxChar;
}
static void GuessHash(int hash, int maxLength = 256)
{
Stopwatch sw = Stopwatch.StartNew();
var timeLimit = new TimeSpan(0, 15, 0);
if (maxLength > maxStrLength)
maxLength = maxStrLength;
long currentHash = (uint)hash;
long step = 1L << 32;
const int bufferSize = 32 * 1024;
using (var writer = new StreamWriter("output.txt", false, new UTF8Encoding(false), bufferSize))
{
for (int strLength = 0; strLength <= maxLength; strLength++)
{
long maxHash = maxHashValues[strLength];
if (currentHash > maxHash)
continue;
long minHash = minHashValues[strLength];
while (currentHash < minHash)
currentHash += step;
while ((currentHash <= maxHash) && (sw.Elapsed < timeLimit))
{
GuessLongHash(writer, currentHash, new char[strLength], strLength - 1);
currentHash += step;
}
}
}
}
static void GuessLongHash(StreamWriter writer, long hash, char[] chars, int charPos)
{
if (hash <= maxChar)
{
char ch = (char)hash;
if (ch >= minChar)
{
chars[charPos] = ch;
writer.WriteLine(new string(chars));
}
return;
}
char c = (char)(hash % 31);
while (c < minChar)
c += (char)31;
while (c <= maxChar)
{
chars[charPos] = c;
GuessLongHash(writer, (hash - c) / 31, chars, charPos - 1);
c += (char)31;
}
}
static long Pow(int value, int exponent)
{
long result = 1;
for (int i = 0; i < exponent; i++)
result *= value;
return result;
}
static void Main(string[] args)
{
const int hash = 69066349;
GuessHash(hash);
}
}
}
Scala
One "useful" characteristic of Java's String.hashCode
method is that if you have a string z
with z.hashCode() == 0
, then you also have that (z + x).hashCode() == x.hashCode()
for any other string x
. So even if we can find a relatively small number of strings with a given hashCode, we can just prefix it with a bunch of zero-hash strings, and generate as many strings as we like with that hashCode - the performance limit becomes string concatenation and IO at that point.
So how do we find a string with a particular hashCode? In truth, brute force would probably be enough, due to the zero-hash trick, but we can easily get more speed with a meet-in-the-middle trick.
Let's suppose we want a string of length 10. To do this, we calculate the hashes of all strings of length 5. In this case, the hash of x + y
will be x.hashCode() * 31 ** 5 + y.hashCode
, so we can either start with x
, and look for strings y
whose hash is target - x.hashCode() * 31 ** 5
, or start with y
and look for strings whose hash is (target - y.hashCode) / (31 ** 5)
(where division is modulo 2^32). In fact, we do both.
Right now, even length strings are better optimised than odd length strings, for reasons that are probably apparent.
Using the zero-hash trick, I can get 1.6 million strings per second on my laptop - printing the output is the main bottleneck.
Without the zero-hash trick (for example, if you want short strings - too short to pad with zero-hash strings), the performance is more modest. The birthday paradox means it gets faster as it runs, but it seems to level-off at around 70000 strings per second, for strings of length 10.
There's one worthwhile optimisation to note. If we're storing the hashes for a large number of strings, it would seem to make sense to either keep them in a hashtable, or an equivalent on-disk structure. However, an on-disk structure is too slow, and there isn't room for everything in memory (on my laptop, at least). So instead, we keep the data in a fixed-size 1-way cache. This means that data can be lost from the cache, but it improves performance. By increasing hashtableSize
, you can increase the size of this cache, using more memory, and leading to more matches.
import scala.collection.mutable
import scala.collection.JavaConversions._
import java.io.File
import java.io.RandomAccessFile
import java.io.FileWriter
object Unhash {
val maxChar = 126.toChar
val minChar = 32.toChar
val hashtableSize = 0x4000000
def allStrings(l: Int): Iterator[String] = {
new Iterator[String] {
val currentArray = Array.fill(l)(minChar)
var _hasNext = l > 0
def hasNext = _hasNext
def next(): String = {
if (!_hasNext) throw new NoSuchElementException
val result = new String(currentArray)
var i = 0
while (i < l) {
if (currentArray(i) != maxChar) {
val result = new String(currentArray)
currentArray(i) = (currentArray(i) + 1).toChar
return result
} else {
currentArray(i) = minChar
i += 1
}
}
_hasNext = false
return result
}
}
}
def pow(x: Int, n: Int) = {
var r = 1
var p = n
var m = x
while (p > 0) {
if ((p & 1) == 1) r *= m
m *= m
p >>= 1
}
r
}
def guessHash(target: Int, length: Int) = {
length match {
case l if l % 2 == 0 => findEven(target, length / 2)
case _ => findOdd(target, (length - 1) / 2)
}
}
def findEven(target: Int, halfLength: Int) = {
val hashes = new HashCache(halfLength, hashtableSize)
val multFactor = pow(31, halfLength)
val divideFactor = BigInt(multFactor).modInverse(BigInt(2) pow 32).intValue
new Traversable[String] {
def foreach[U](f: String => U) = {
for (string <- allStrings(halfLength)) {
val hash = string.hashCode
hashes.put(string)
// Find possible suffixes
{
val effectiveTarget = target - hash * multFactor
for (x <- hashes.get(effectiveTarget)) f(string + x)
}
// Find possible prefixes
{
val effectiveTarget = (target - hash) * divideFactor
for (x <- hashes.get(effectiveTarget)) f(x + string)
}
}
}
}
}
def findOdd(target: Int, halfLength: Int) = {
val hashes = new HashCache(halfLength, hashtableSize)
val multFactor = pow(31, halfLength)
val divideFactor = BigInt(multFactor).modInverse(BigInt(2) pow 32).intValue
new Traversable[String] {
def foreach[U](f: String => U) = {
for (string <- allStrings(halfLength);
hash = string.hashCode;
_ = hashes.put(string);
firstChar <- minChar to maxChar) {
// Adjust the target, by subtracting the contribution from the first char
val currentTarget = target - firstChar * pow(31, 2 * halfLength)
// Find possible suffixes
{
val effectiveTarget = currentTarget - hash * multFactor
for (x <- hashes.get(effectiveTarget)) f(firstChar + string + x)
}
//Find possible prefixes
{
val effectiveTarget = (currentTarget - hash) * divideFactor
for (x <- hashes.get(effectiveTarget)) f(firstChar + x + string)
}
}
}
}
}
def guessAnyLength(target: Int) = {
new Traversable[String] {
def foreach[U](f: String => U) = {
val zeroHashes = findEven(0, 4).toArray
for (len5 <- findEven(target, 4);
zero1 <- zeroHashes.iterator;
zero2 <- zeroHashes.iterator;
zero3 <- zeroHashes.iterator) f(zero1 + zero2 + zero3 + len5)
}
}
}
def main(args: Array[String]) = {
val startTime = System.nanoTime
val Array(output, length, target) = args
val t = target.toInt
val resultIterator = length match {
case "any" =>
guessAnyLength(t)
case _ =>
val l = length.toInt
guessHash(t, l)
}
var i = 0L
var lastPrinted = System.nanoTime
var lastI = 0L
val outputWriter = new FileWriter(output)
try {
for (res <- resultIterator) {
outputWriter.write(res + "\n")
i += 1
val now = System.nanoTime
if (now > lastPrinted + 1000000000L) {
println(s"Found $i total, at ${i - lastI} per second")
lastPrinted = System.nanoTime
lastI = i
}
}
} finally {
outputWriter.close()
}
}
}
class HashCache(stringLength: Int, size: Int) {
require ((size & (size - 1)) == 0, "Size must be power of 2")
val data = new Array[Char](stringLength * size)
def get(hash: Int) = {
val loc = hash & (size - 1)
val string = new String(data, stringLength * loc, stringLength)
if (string.hashCode == hash) Some(string)
else None
}
def put(string: String) = {
val loc = string.hashCode & (size - 1)
string.getChars(0, stringLength, data, stringLength * loc)
}
}
C, 3.2 × 1010 preimages
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[])
{
char output[29] = { 0 }, preimages[64][8], zeros[512][8] = { "/*$+06(" };
unsigned i, j, k, l, t;
t = atoi(argv[1]);
for(i = 7; i --> 0; t /= 31)
preimages[0][i] = zeros[0][i] + (t % 31);
for(i = 0, j = 2, k = 1; i < 6; i += 1, j *= 2)
for(l = 0; k < j; k++, l++)
strcpy(preimages[k], preimages[l]),
preimages[k][i] -= 1,
preimages[k][i + 1] += 31;
for(i = 0, j = 3, k = 1; i < 6; i += 1, j *= 3)
for(l = 0; k < j && k < 512; k++, l++)
strcpy(zeros[k],zeros[l%(j/3)]),
t = 1 + (l/(j/3)),
zeros[k][i] -= t,
zeros[k][i + 1] += 31 * t;
for(i = 0, k = 0; i < (1L << 36); i++)
{
strcpy(&output[0], zeros[(i >> 27) & 511]);
strcpy(&output[7], zeros[(i >> 18) & 511]);
strcpy(&output[14], zeros[(i >> 9) & 511]);
strcpy(&output[21], zeros[(i >> 0) & 511]);
for(j = 0; j < 64; j += 4)
{
// k += 4;
// if(k++ % 1000000 == 0) printf("%d\n", k);
printf("%s%s\n%s%s\n%s%s\n%s%s\n",
output, preimages[j + 0],
output, preimages[j + 1],
output, preimages[j + 2],
output, preimages[j + 3]
);
}
}
return 0;
}
Calculates a preimage and adds and prefixes strings with hash 0. The time spent to generate the preimages should be negligible; all the code has to do is piece together five strings. That means that most of the time will be spent printing the results.
In one hour, only 3.2 × 1010 preimages can be printed. If I comment the printf statement out, the program reports that it has found 5.5 × 1012 preimages. However, I don't know if the compiler just ignores parts of the code (and I don't have the assembly knowledge to verify).