What is the best 32bit hash function for short strings (tag names)?
I'm not sure if it's the best choice, but here is a hash function for strings:
The Practice of Programming (HASH TABLES, pg. 57)
/* hash: compute hash value of string */
unsigned int hash(char *str)
{
unsigned int h;
unsigned char *p;
h = 0;
for (p = (unsigned char*)str; *p != '\0'; p++)
h = MULTIPLIER * h + *p;
return h; // or, h % ARRAY_SIZE;
}
Empirically, the values 31 and 37 have proven to be good choices for the multiplier in a hash function for ASCII strings.
I'm sorry for the very late reply on this. Earlier this year I composed a page titled Hashing Short Strings which might be helpful in this discussion. In summary, I found that CRC-32 and FNV-1a are superior for hashing short strings. They are efficient and produced widely distributed and collision free hashes in my tests. I was surprised to find that MD5, SHA-1 and SHA-3 produced small numbers of collisions when the output was folded down to 32-bits.
That depends on your hardware.
On modern hardware, i.e. Intel/AMD with SSE4.2 or arm7 you should use the internal _mm_crc32_uxx
intrinsics, as they are optimal for short strings. (For long keys also, but then better use Adler's threaded version, as in zlib)
On old or unknown hardware, either run-time probe for the SSE4.2 or CRC32 feature or just use one if the simple good hash functions. E.g. Murmur2 or City
An overview of quality and performance is here: https://github.com/rurban/smhasher#smhasher
There are also all the implementations. Favored are https://github.com/rurban/smhasher/blob/master/crc32_hw.c and https://github.com/rurban/smhasher/blob/master/MurmurHash2.cpp
If you know the keys in advance, use a perfect hash, not a hash function. E.g. gperf or my phash: https://github.com/rurban/Perfect-Hash#name
Nowadays perfect hash generation via a c compiler is so fast, you can even create them on the fly, and dynaload it.
If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.