List<T>.Contains() is very slow?
If you are just checking for existence, HashSet<T>
in .NET 3.5 is your best option - dictionary-like performance, but no key/value pair - just the values:
HashSet<int> data = new HashSet<int>();
for (int i = 0; i < 1000000; i++)
{
data.Add(rand.Next(50000000));
}
bool contains = data.Contains(1234567); // etc
List.Contains is a O(n) operation.
Dictionary.ContainsKey is a O(1) operation, since it uses the hashcode of the objects as a key, which gives you a quicker search ability.
I don't think that it 's a good idea to have a List which contains a million entries. I don't think that the List class was designed for that purpose. :)
Isn't it possible to save those millon entities into a RDBMS for instance, and perform queries on that database ?
If it is not possible, then I would use a Dictionary anyway.
I think I have the answer! Yes, it's true that Contains() on a list (array) is O(n), but if the array is short and you are using value types, it still should be quite fast. But using the CLR Profiler [free download from Microsoft], I discovered that Contains() is boxing values in order to compare them, which requires heap allocation, which is VERY expensive (slow). [Note: This is .Net 2.0; other .Net versions not tested.]
Here's the full story and solution. We have an enumeration called "VI" and made a class called "ValueIdList" which is an abstract type for a list (array) of VI objects. The original implementation was in the ancient .Net 1.1 days, and it used an encapsulated ArrayList. We discovered recently in http://blogs.msdn.com/b/joshwil/archive/2004/04/13/112598.aspx that a generic list (List<VI>) performs much better than ArrayList on value types (like our enum VI) because the values don't have to be boxed. It's true and it worked... almost.
The CLR Profiler revealed a surprise. Here's a portion of the Allocation Graph:
- ValueIdList::Contains bool(VI) 5.5MB (34.81%)
- Generic.List::Contains bool(<UNKNOWN>) 5.5MB (34.81%)
- Generic.ObjectEqualityComparer<T>::Equals bool (<UNKNOWN> <UNKNOWN>) 5.5MB (34.88%)
- Values.VI 7.7MB (49.03%)
As you can see, Contains() surprisingly calls Generic.ObjectEqualityComparer.Equals(), which apparently requires the boxing of a VI value, which requires expensive heap allocation. It's odd that Microsoft would eliminate boxing on the list, only to require it again for a simple operation such as this.
Our solution was to re-write the Contains() implementation, which in our case was easy to do since we were already encapsulating the generic list object (_items). Here's the simple code:
public bool Contains(VI id)
{
return IndexOf(id) >= 0;
}
public int IndexOf(VI id)
{
int i, count;
count = _items.Count;
for (i = 0; i < count; i++)
if (_items[i] == id)
return i;
return -1;
}
public bool Remove(VI id)
{
int i;
i = IndexOf(id);
if (i < 0)
return false;
_items.RemoveAt(i);
return true;
}
The comparison of VI values is now being done in our own version of IndexOf() which requires no boxing, and it's very fast. Our particular program sped up by 20% after this simple re-write. O(n)... no problem! Just avoid the wasted memory usage!