List<T>.Contains() is very slow?

If you are just checking for existence, HashSet<T> in .NET 3.5 is your best option - dictionary-like performance, but no key/value pair - just the values:

    HashSet<int> data = new HashSet<int>();
    for (int i = 0; i < 1000000; i++)
    {
        data.Add(rand.Next(50000000));
    }
    bool contains = data.Contains(1234567); // etc

List.Contains is a O(n) operation.

Dictionary.ContainsKey is a O(1) operation, since it uses the hashcode of the objects as a key, which gives you a quicker search ability.

I don't think that it 's a good idea to have a List which contains a million entries. I don't think that the List class was designed for that purpose. :)

Isn't it possible to save those millon entities into a RDBMS for instance, and perform queries on that database ?

If it is not possible, then I would use a Dictionary anyway.


I think I have the answer! Yes, it's true that Contains() on a list (array) is O(n), but if the array is short and you are using value types, it still should be quite fast. But using the CLR Profiler [free download from Microsoft], I discovered that Contains() is boxing values in order to compare them, which requires heap allocation, which is VERY expensive (slow). [Note: This is .Net 2.0; other .Net versions not tested.]

Here's the full story and solution. We have an enumeration called "VI" and made a class called "ValueIdList" which is an abstract type for a list (array) of VI objects. The original implementation was in the ancient .Net 1.1 days, and it used an encapsulated ArrayList. We discovered recently in http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx that a generic list (List<VI>) performs much better than ArrayList on value types (like our enum VI) because the values don't have to be boxed. It's true and it worked... almost.

The CLR Profiler revealed a surprise. Here's a portion of the Allocation Graph:

  • ValueIdList::Contains bool(VI) 5.5MB (34.81%)
  • Generic.List::Contains bool(<UNKNOWN>) 5.5MB (34.81%)
  • Generic.ObjectEqualityComparer<T>::Equals bool (<UNKNOWN> <UNKNOWN>) 5.5MB (34.88%)
  • Values.VI 7.7MB (49.03%)

As you can see, Contains() surprisingly calls Generic.ObjectEqualityComparer.Equals(), which apparently requires the boxing of a VI value, which requires expensive heap allocation. It's odd that Microsoft would eliminate boxing on the list, only to require it again for a simple operation such as this.

Our solution was to re-write the Contains() implementation, which in our case was easy to do since we were already encapsulating the generic list object (_items). Here's the simple code:

public bool Contains(VI id) 
{
  return IndexOf(id) >= 0;
}

public int IndexOf(VI id) 
{ 
  int i, count;

  count = _items.Count;
  for (i = 0; i < count; i++)
    if (_items[i] == id)
      return i;
  return -1;
}

public bool Remove(VI id) 
{
  int i;

  i = IndexOf(id);
  if (i < 0)
    return false;
  _items.RemoveAt(i);

  return true;
}

The comparison of VI values is now being done in our own version of IndexOf() which requires no boxing, and it's very fast. Our particular program sped up by 20% after this simple re-write. O(n)... no problem! Just avoid the wasted memory usage!