unordered_map excess calls to hash function

Firstly, a couple of observations:

  • The unordered map is both a hash table, and a singly-linked list.

    See here that begin returns an iterator which models LegacyForwardIterator.

  • Inserting an entry into the map requires updating both the hash table and the linked list.

Secondly, a couple of notes on these containers' implementation decisions:

  • For singly-linked lists, it's common to have a sentinel node which doesn't contain any data (for something like Node<T>, it'll still have a T, just default-initialized). We only want it for its next pointer, because it helps keep list operations regular (ie, we don't have to write insert-at-the-head and insert-after-node as different special cases).

  • For hash tables (assuming linked-list buckets, since it's required by the standard) we can either use Node table[N] (so each bucket has its own sentinel preallocated) or Node* table[N].

    In this case, since we're actually using Node<T> and don't know the size of T, it seems reasonable to store a pointer for each bucket.

  • For a hash table which is also a singly-linked list, it makes sense to use the per-bucket list as (part of) the all-elements list. Otherwise we'd need to store two pointers per node, next_in_bucket and next_in_list.

    This means that the "sentinel" (one-before-the-beginning) node pointed to by a bucket is actually the last node of the previous bucket ... except for the bucket at the front of the list, when it really is the overall list sentinel.

    The comments in the code say

      /* ...
      *  The non-empty buckets contain the node before the first node in the
      *  bucket. This design makes it possible to implement something like a
      *  std::forward_list::insert_after on container insertion and
      *  std::forward_list::erase_after on container erase
      *  calls. _M_before_begin is equivalent to
      *  std::forward_list::before_begin. Empty buckets contain
      *  nullptr.  Note that one of the non-empty buckets contains
      *  &_M_before_begin which is not a dereferenceable node so the
      *  node pointer in a bucket shall never be dereferenced, only its
      *  next node can be.
    

    (the sentinel is _M_before_begin in this code)

So, when we add an element to an already-populated bucket, the steps are roughly

void insert_to_non_empty_bucket(Node *n, Key k) {
  Node *sentinel = table[k];
  n->next = sentinel->next;
  sentinel->next = n;
}

Note again that we don't know or care whether the sentinel here is the last element of the previous bucket, or the overall list sentinel. The code is the same either way (which was one of the reasons for using a sentinel in the first place).

However, when we add the first element to an empty bucket (and it is not the only non-empty bucket), we have one extra step: we need to update the sentinel pointer for the next bucket, to point to our new node. Otherwise we'd have two buckets both pointing to the list sentinel.

void insert_to_empty_bucket(Node *n, Key k) {
  Node *sentinel = &list_sentinel; // ie, &_M_before_begin
  n->next = sentinel->next;
  sentinel->next = n;

  // update the *next* bucket in the table
  table[n->next->key] = n;
}

Finally: in this implementation, Node doesn't cache the key, so there is no n->next->key. There is actually a trait controlling this, but it's clearly false in this case, which means that final line has to re-compute the hash in order to update the next bucket.


NB. just to clarify, when I say previous bucket or next bucket, I'm just talking about position in the list, where buckets appear in reverse order of when they became non-empty. It doesn't have anything to do with position in the table, or imply any intrinsic ordering.


As others pointed out, an unordered map, which is just a form of a hash table, is in libstdc++ implemented basically just as a single ("global") linked list. Plus, there is an array of buckets that point into this list. What is important is that the pointer stored in bucket[i] does not point to the first node that belongs to this bucket (according to hash function mapping), but to its predecessor in the global list instead. The reason is obvious — when you add an item into the singly-linked list, you need to update its predecessor. Here, when you need to insert an element into some bucket, you need to update the predecessor of the first node of this bucket.

However, the very first node of the global linked list does not have any predecessor. To make things unified, there is a sentinel node that plays this role. In libstdc++, it's a member variable _M_before_begin.

Let's assume that we have a hash table with keys A and B that belong to bucket[0] and a key C that belongs to bucket[1]. It may, for instance, look like as follows:

global linked list          buckets[]
------------------          ---------

_M_before_begin  <--------  bucket[0]
       |
       v
node_with_key_A 
       |
       v
node_with_key_B  <--------  bucket[1]
       |
       v
node_with_key_C
       |
       x

Now, when a new key, say D, is added into an empty bucket, say bucket[2], libstdc++ inserts it at the beginning of the global linked list.

Therefore, the situation after this insertion is as follows:

global linked list          buckets[]
------------------          ---------

_M_before_begin  <--------  bucket[2]
       |
       v
node_with_key_D  <--------  bucket[0]
       |
       v
node_with_key_A 
       |
       v
node_with_key_B  <--------  bucket[1]
       |
       v
node_with_key_C
       |
       x

Note that bucket[0] that corresponds with node_with_key_A pointed to by _M_before_begin needs to be updated. And, since, as again pointed to by others, libstdc++ does not cache hash values by default, the only option how to find a bucket index for node_with_key_A is to trigger a hash function.

Note that basically I just said the same as others, but wanted to add some illustrations that may help.


Another consequence of this approach is that hash function might be called during lookup: https://godbolt.org/z/K6qhWc. The reason is that the first element for some bucket is known, but not the last one. Therefore, the hash function for node keys needs to be resolved to find out whether a node still belongs to the actual bucket during the linked list traversal.


I can't explain why it is done that way, but it doesn't fit in a comment, so I leave it here in the answer section. You have two parts in the stdlib (10.1.0) upon insertion of an element:

__hash_code __code = __h->_M_hash_code(__k);

Which calculates the hash value of the element to insert __k.

And later on this part of the code:

    {
      // The bucket is empty, the new node is inserted at the
      // beginning of the singly-linked list and the bucket will
      // contain _M_before_begin pointer.
      __node->_M_nxt = _M_before_begin._M_nxt;
      _M_before_begin._M_nxt = __node;
      if (__node->_M_nxt)
        // We must update former begin bucket that is pointing to
        // _M_before_begin.
        _M_buckets[_M_bucket_index(__node->_M_next())] = __node;
      _M_buckets[__bkt] = &_M_before_begin;
    }

Where _M_bucket_index calculates the hash for __node->_M_next(), __node referes to the node created for __k.

Maybe that helps you or someone else to explain it further.