What is the best 32bit hash function for short strings (tag names)?

I'm not sure if it's the best choice, but here is a hash function for strings:

The Practice of Programming (HASH TABLES, pg. 57)

/* hash: compute hash value of string */
unsigned int hash(char *str)
{
   unsigned int h;
   unsigned char *p;

   h = 0;
   for (p = (unsigned char*)str; *p != '\0'; p++)
      h = MULTIPLIER * h + *p;
   return h; // or, h % ARRAY_SIZE;
}

Empirically, the values 31 and 37 have proven to be good choices for the multiplier in a hash function for ASCII strings.


I'm sorry for the very late reply on this. Earlier this year I composed a page titled Hashing Short Strings which might be helpful in this discussion. In summary, I found that CRC-32 and FNV-1a are superior for hashing short strings. They are efficient and produced widely distributed and collision free hashes in my tests. I was surprised to find that MD5, SHA-1 and SHA-3 produced small numbers of collisions when the output was folded down to 32-bits.


That depends on your hardware. On modern hardware, i.e. Intel/AMD with SSE4.2 or arm7 you should use the internal _mm_crc32_uxx intrinsics, as they are optimal for short strings. (For long keys also, but then better use Adler's threaded version, as in zlib)

On old or unknown hardware, either run-time probe for the SSE4.2 or CRC32 feature or just use one if the simple good hash functions. E.g. Murmur2 or City

An overview of quality and performance is here: https://github.com/rurban/smhasher#smhasher

There are also all the implementations. Favored are https://github.com/rurban/smhasher/blob/master/crc32_hw.c and https://github.com/rurban/smhasher/blob/master/MurmurHash2.cpp

If you know the keys in advance, use a perfect hash, not a hash function. E.g. gperf or my phash: https://github.com/rurban/Perfect-Hash#name

Nowadays perfect hash generation via a c compiler is so fast, you can even create them on the fly, and dynaload it.


If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.