Un-hash the Java Hash

C#, 1.052 × 1010

The idea is to do the hash function in reverse to get all possible un-hashed strings from the given number. Integer overflow of the 32-bit hash value is also emulated in reverse by incrementing the given hash value by 2^32 on each iteration.

UPDATE: I've made several micro-optimisations that significantly improved the performance of my program. The new version produced 10.52 billion results in 15 minutes (compared to 9.05 billion in 1 hour for the previous version).

Futher improvement may be difficult since the output file was 117GB in size and that's about how much free space I have on my SSD. Writing results to the screen doesn't seem to be a viable option since it's almost a thousand times slower than writing to a file.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;

namespace Unhash
{
    class Program
    {
        const int minChar = 32;

        const int maxChar = 126;

        // limit maxStrLength to make sure that max hash value does not exceed long.MaxValue (2^63 - 1)
        const int maxStrLength = 12;

        static long[] minHashValues;

        static long[] maxHashValues;

        static Program()
        {
            minHashValues = new long[maxStrLength + 1];
            for (int strLength = 0; strLength <= maxStrLength; strLength++)
                minHashValues[strLength] = Pow(31, strLength) / (31 - 1) * minChar;

            maxHashValues = new long[maxStrLength + 1];
            for (int strLength = 0; strLength <= maxStrLength; strLength++)
                maxHashValues[strLength] = Pow(31, strLength) / (31 - 1) * maxChar;
        }

        static void GuessHash(int hash, int maxLength = 256)
        {
            Stopwatch sw = Stopwatch.StartNew();
            var timeLimit = new TimeSpan(0, 15, 0);

            if (maxLength > maxStrLength)
                maxLength = maxStrLength;

            long currentHash = (uint)hash;
            long step = 1L << 32;

            const int bufferSize = 32 * 1024;
            using (var writer = new StreamWriter("output.txt", false, new UTF8Encoding(false), bufferSize))
            {
                for (int strLength = 0; strLength <= maxLength; strLength++)
                {
                    long maxHash = maxHashValues[strLength];
                    if (currentHash > maxHash)
                        continue;

                    long minHash = minHashValues[strLength];
                    while (currentHash < minHash)
                        currentHash += step;

                    while ((currentHash <= maxHash) && (sw.Elapsed < timeLimit))
                    {
                        GuessLongHash(writer, currentHash, new char[strLength], strLength - 1);
                        currentHash += step;
                    }
                }
            }
        }

        static void GuessLongHash(StreamWriter writer, long hash, char[] chars, int charPos)
        {
            if (hash <= maxChar)
            {
                char ch = (char)hash;
                if (ch >= minChar)
                {
                    chars[charPos] = ch;
                    writer.WriteLine(new string(chars));
                }
                return;
            }

            char c = (char)(hash % 31);
            while (c < minChar)
                c += (char)31;

            while (c <= maxChar)
            {
                chars[charPos] = c;
                GuessLongHash(writer, (hash - c) / 31, chars, charPos - 1);

                c += (char)31;
            }
        }

        static long Pow(int value, int exponent)
        {
            long result = 1;
            for (int i = 0; i < exponent; i++)
                result *= value;
            return result;
        }

        static void Main(string[] args)
        {
            const int hash = 69066349;
            GuessHash(hash);
        }
    }
}

Scala

One "useful" characteristic of Java's String.hashCode method is that if you have a string z with z.hashCode() == 0, then you also have that (z + x).hashCode() == x.hashCode() for any other string x. So even if we can find a relatively small number of strings with a given hashCode, we can just prefix it with a bunch of zero-hash strings, and generate as many strings as we like with that hashCode - the performance limit becomes string concatenation and IO at that point.

So how do we find a string with a particular hashCode? In truth, brute force would probably be enough, due to the zero-hash trick, but we can easily get more speed with a meet-in-the-middle trick.

Let's suppose we want a string of length 10. To do this, we calculate the hashes of all strings of length 5. In this case, the hash of x + y will be x.hashCode() * 31 ** 5 + y.hashCode, so we can either start with x, and look for strings y whose hash is target - x.hashCode() * 31 ** 5, or start with y and look for strings whose hash is (target - y.hashCode) / (31 ** 5) (where division is modulo 2^32). In fact, we do both.

Right now, even length strings are better optimised than odd length strings, for reasons that are probably apparent.

Using the zero-hash trick, I can get 1.6 million strings per second on my laptop - printing the output is the main bottleneck.

Without the zero-hash trick (for example, if you want short strings - too short to pad with zero-hash strings), the performance is more modest. The birthday paradox means it gets faster as it runs, but it seems to level-off at around 70000 strings per second, for strings of length 10.

There's one worthwhile optimisation to note. If we're storing the hashes for a large number of strings, it would seem to make sense to either keep them in a hashtable, or an equivalent on-disk structure. However, an on-disk structure is too slow, and there isn't room for everything in memory (on my laptop, at least). So instead, we keep the data in a fixed-size 1-way cache. This means that data can be lost from the cache, but it improves performance. By increasing hashtableSize, you can increase the size of this cache, using more memory, and leading to more matches.

import scala.collection.mutable
import scala.collection.JavaConversions._
import java.io.File
import java.io.RandomAccessFile
import java.io.FileWriter

object Unhash {
  val maxChar = 126.toChar
  val minChar = 32.toChar
  val hashtableSize = 0x4000000

  def allStrings(l: Int): Iterator[String] = {
    new Iterator[String] {
      val currentArray = Array.fill(l)(minChar)
      var _hasNext = l > 0
      def hasNext = _hasNext

      def next(): String = {
        if (!_hasNext) throw new NoSuchElementException
        val result = new String(currentArray)
        var i = 0
        while (i < l) {
          if (currentArray(i) != maxChar) {
            val result = new String(currentArray)
            currentArray(i) = (currentArray(i) + 1).toChar
            return result
          } else {
            currentArray(i) = minChar
            i += 1
          }
        }
        _hasNext = false
        return result
      }
    }
  }

  def pow(x: Int, n: Int) = {
    var r = 1
    var p = n
    var m = x
    while (p > 0) {
      if ((p & 1) == 1) r *= m
      m *= m
      p >>= 1
    }
    r
  }

  def guessHash(target: Int, length: Int) = {
    length match {
      case l if l % 2 == 0 => findEven(target, length / 2)
      case _ => findOdd(target, (length - 1) / 2)
    }
  }

  def findEven(target: Int, halfLength: Int) = {
    val hashes = new HashCache(halfLength, hashtableSize)
    val multFactor = pow(31, halfLength)
    val divideFactor = BigInt(multFactor).modInverse(BigInt(2) pow 32).intValue

    new Traversable[String] {
      def foreach[U](f: String => U) = {
        for (string <- allStrings(halfLength)) {
          val hash = string.hashCode
          hashes.put(string)

          // Find possible suffixes
          {
            val effectiveTarget = target - hash * multFactor
            for (x <- hashes.get(effectiveTarget)) f(string + x)
          }

          // Find possible prefixes
          {
            val effectiveTarget = (target - hash) * divideFactor
            for (x <- hashes.get(effectiveTarget)) f(x + string)
          }
        }
      }
    }
  }

  def findOdd(target: Int, halfLength: Int) = {
    val hashes = new HashCache(halfLength, hashtableSize)
    val multFactor = pow(31, halfLength)
    val divideFactor = BigInt(multFactor).modInverse(BigInt(2) pow 32).intValue

    new Traversable[String] {
      def foreach[U](f: String => U) = {
        for (string <- allStrings(halfLength);
             hash = string.hashCode;
             _ = hashes.put(string);
             firstChar <- minChar to maxChar) {
          // Adjust the target, by subtracting the contribution from the first char
          val currentTarget = target - firstChar * pow(31, 2 * halfLength)

          // Find possible suffixes
          {
            val effectiveTarget = currentTarget - hash * multFactor
            for (x <- hashes.get(effectiveTarget)) f(firstChar + string + x)
          }

          //Find possible prefixes
          {
            val effectiveTarget = (currentTarget - hash) * divideFactor
            for (x <- hashes.get(effectiveTarget)) f(firstChar + x + string)
          }
        }
      }
    }
  }

  def guessAnyLength(target: Int) = {
    new Traversable[String] {
      def foreach[U](f: String => U) = {
        val zeroHashes = findEven(0, 4).toArray
        for (len5 <- findEven(target, 4);
             zero1 <- zeroHashes.iterator;
             zero2 <- zeroHashes.iterator;
             zero3 <- zeroHashes.iterator) f(zero1 + zero2 + zero3 + len5)
      }
    }
  }

  def main(args: Array[String]) = {
    val startTime = System.nanoTime
    val Array(output, length, target) = args
    val t = target.toInt

    val resultIterator = length match {
      case "any" =>
        guessAnyLength(t)
      case _ =>
        val l = length.toInt
        guessHash(t, l)
    }

    var i = 0L
    var lastPrinted = System.nanoTime
    var lastI = 0L

    val outputWriter = new FileWriter(output)
    try {
      for (res <- resultIterator) {
        outputWriter.write(res + "\n")

        i += 1

        val now = System.nanoTime
        if (now > lastPrinted + 1000000000L) {
          println(s"Found $i total, at ${i - lastI} per second")
          lastPrinted = System.nanoTime
          lastI = i
        }
      }
    } finally {
      outputWriter.close()
    }

  }
}

class HashCache(stringLength: Int, size: Int) {
  require ((size & (size - 1)) == 0, "Size must be power of 2")
  val data = new Array[Char](stringLength * size)

  def get(hash: Int) = {
    val loc = hash & (size - 1)
    val string = new String(data, stringLength * loc, stringLength)
    if (string.hashCode == hash) Some(string)
    else None
  }

  def put(string: String) = {
    val loc = string.hashCode & (size - 1)
    string.getChars(0, stringLength, data, stringLength * loc)
  }
}

C, 3.2 × 1010 preimages

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
    char output[29] = { 0 }, preimages[64][8], zeros[512][8] = { "/*$+06(" };
    unsigned i, j, k, l, t;

    t = atoi(argv[1]);

    for(i = 7; i --> 0; t /= 31)
        preimages[0][i] = zeros[0][i] + (t % 31);

    for(i = 0, j = 2, k = 1; i < 6; i += 1, j *= 2)
        for(l = 0; k < j; k++, l++)
            strcpy(preimages[k], preimages[l]),
            preimages[k][i] -= 1,
            preimages[k][i + 1] += 31;

    for(i = 0, j = 3, k = 1; i < 6; i += 1, j *= 3)
        for(l = 0; k < j && k < 512; k++, l++)
            strcpy(zeros[k],zeros[l%(j/3)]),
            t = 1 + (l/(j/3)),
            zeros[k][i] -= t,
            zeros[k][i + 1] += 31 * t;

    for(i = 0, k = 0; i < (1L << 36); i++)
    {
        strcpy(&output[0],  zeros[(i >> 27) & 511]);
        strcpy(&output[7],  zeros[(i >> 18) & 511]);
        strcpy(&output[14], zeros[(i >>  9) & 511]);
        strcpy(&output[21], zeros[(i >>  0) & 511]);

        for(j = 0; j < 64; j += 4)
        {
           // k += 4;
           // if(k++ % 1000000 == 0) printf("%d\n", k);

           printf("%s%s\n%s%s\n%s%s\n%s%s\n",
               output, preimages[j + 0],
               output, preimages[j + 1],
               output, preimages[j + 2],
               output, preimages[j + 3]
           );
        }
    }
    return 0;
}

Calculates a preimage and adds and prefixes strings with hash 0. The time spent to generate the preimages should be negligible; all the code has to do is piece together five strings. That means that most of the time will be spent printing the results.

In one hour, only 3.2 × 1010 preimages can be printed. If I comment the printf statement out, the program reports that it has found 5.5 × 1012 preimages. However, I don't know if the compiler just ignores parts of the code (and I don't have the assembly knowledge to verify).