Time complexity of creating hash value of a string in hashtable
Inserting etc. in a hashtable is O(1) in the sense that it is constant (or more precisely, bounded) in regard to the number of elements in the table.
The "O(1)" in this context makes no claim about how fast you can compute your hashes. If the effort for this grows in some way, that is the way it is. However, I find it unlikely that the complexity of a decent (i.e. "fit for this application") hash function will ever be worse than linear in the "size" (i.e. the length in our string-example) of the object being hashed.
It's usually said that inserting and finding a string in a hashtable is O(1). But how is hash key of a string made ? Why it's not O(L), length of string? It's clear for me that why for integers it's O(1), but not for strings.
The O(1) commonly quoted means the time doesn't grow with the number of elements in the container. As you say, the time to generate a hash value from a string might not itself be O(1) in the length of the string - though for some implementations it is: for example Microsoft's C++ std::hash<std::string>
has:
size_t _Val = 2166136261U;
size_t _First = 0;
size_t _Last = _Keyval.size();
size_t _Stride = 1 + _Last / 10;
if (_Stride < _Last)
_Last -= _Stride;
for(; _First < _Last; _First += _Stride)
_Val = 16777619U * _Val ^ (size_t)_Keyval[_First];
return (_Val);
The _Stride
is a tenth of the string length, so a fixed number of characters that far apart will be incorporated in the hash value. Such a hash function is O(1) in the length of the string.
GCC's C++ Standard library takes a different approach: in v4.7.2 at least, it calls down through a _Hash_impl
support class to the static
non-member function _Hash_bytes
, which does a Murmur hash incorporating every byte. GCC's hash<std::string>
is therefore O(N) in the length of the string.
- GCC's higher prioritorisation of collision minimisation is also evident in its use of prime numbers of buckets for
std::unordered_set
andstd::unordered_map
, which MS's implementation doesn't do - at least up until VS2013/VC12; summarily MS's approach will be lighter-weight/faster for keys that aren't collision prone, and at lower load factors, but degrades earlier and more dramatically otherwise.
And is there any difference between how hash keys for strings are produced between hashTable in java and unordered_map in C++?
How strings are hashed is not specified by the C++ Standard - it's left to the individual compiler implementations. Consequently, different compromises are struck by different compilers - even different versions of the same compiler.
The documentation David Pérez Cabrera's answer links to explains the hashCode
function in Java:
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using
int
arithmetic, wheres[i]
is thei
th character of the string,n
is the length of the string, and^
indicates exponentiation. (The hash value of the empty string is zero.)
That's clearly O(N) in the length of the string.
Returning quickly to...
It's usually said that inserting and finding a string in a hashtable is O(1).
...a "key" ;-P insight is that in many problem domains, the real-world lengths of the strings is known not to vary significantly, or hashing for the worst-case length is still plenty fast enough. Consider a person's or company's name, a street address, an identifier from some source code, a programming-language keyword, a product/book/CD etc name: you can expect a billion keys to take roughly a million times more memory to store than the first thousand. With a hash table, most operations on the entire data set can be expected to take a million times longer. And this will be as true in 100 years' time as it is today. Importantly, if some request comes in related to a single key, it shouldn't take much longer to perform than it used to with a thousand keys (assuming sufficient RAM, and ignoring CPU caching effects) - though sure, if it's a long key it may take longer than for a short key, and if you have ultra-low-latency or hard-realtime requirements, you may care. But, the average throughput for requests with random keys will be constant despite having a million times more data.
Only when you have a problem domain with massive variance in key size and the key-hashing time is significant given your performance needs, or where you expect the average key size to increase over time (e.g. if the keys are video streams, and every few years people are bumping up resolutions and frame rates creating an exponential growth in key size), will you need to pay close attention to the hashing (and key comparison) costs.