How to create a HashSet<List<Int>> with distinct elements?
This starts off wrong, it has to be a HashSet<ReadOnlyCollection<>>
because you cannot allow the lists to change and invalidate the set predicate. This then allows you to calculate a hash code in O(n) when you add the collection to the set. And an O(n) test to check if it is already in the set with a very uncommon O(n^2) worst case if all the hashes turn out to be equal. Store the computed hash with the collection.
Here is a possible comparer that compares an IEnumerable<T>
by its elements. You still need to sort manually before adding.
One could build the sorting into the comparer, but I don't think that's a wise choice. Adding a canonical form of the list seems wiser.
This code will only work in .net 4 since it takes advantage of generic variance. If you need earlier versions you need to either replace IEnumerable
with List
, or add a second generic parameter for the collection type.
class SequenceComparer<T>:IEqualityComparer<IEnumerable<T>>
{
public bool Equals(IEnumerable<T> seq1,IEnumerable<T> seq2)
{
return seq1.SequenceEqual(seq2);
}
public int GetHashCode(IEnumerable<T> seq)
{
int hash=1234567;
foreach(T elem in seq)
hash=hash*37+elem.GetHashCode();
return hash;
}
}
void Main()
{
var hashSet = new HashSet<List<int>>(new SequenceComparer<int>());
List<int> test=new int[]{1,3,2}.ToList();
test.Sort();
hashSet.Add(test);
List<int> test2=new int[]{3,2,1}.ToList();
test2.Sort();
hashSet.Contains(test2).Dump();
}
Is there a reason you aren't just using an array? int[]
will perform better. Also I assume the lists contain duplicates, otherwise you'd just be using sets and not have a problem.
It appears that their contents won't change (much) once they've been added to the HashSet
. At the end of the day, you are going to have to use a comparer that falls back on SequenceEqual
. But you don't have to do it every single time. Instead or doing an exponential number of sequence compares (e.g. -- as the hashset grows, doing a SequenceEqual
against each existing member) -- if you create a good hashcode up front, you may have to do very few such compares. While the overhead of generating a good hashcode is probably about the same as doing a SequenceEqual
you're only doing it a single time for each list.
So, the first time you operate on a particular List<int>
, you should generate a hash based on the ordered sequence of numbers and cache it. Then the next time the list is compared, the cached value can be used. I'm not sure how you might do this with a comparer off the top of my head (maybe a static dictionary?) -- but you could implement List
wrapper that does this easily.
Here's a basic idea. You'd need to be careful to ensure that it isn't brittle (e.g. make sure you void any cached hash code when members change) but it doesn't look like that's going to be a typical situation for the way you're using this.
public class FasterComparingList<T>: IList<T>, IList, ...
/// whatever you need to implement
{
// Implement your interfaces against InnerList
// Any methods that change members of the list need to
// set _LongHash=null to force it to be regenerated
public List<T> InnerList { ... lazy load a List }
public int GetHashCode()
{
if (_LongHash==null) {
_LongHash=GetLongHash();
}
return (int)_LongHash;
}
private int? _LongHash=null;
public bool Equals(FasterComparingList<T> list)
{
if (InnerList.Count==list.Count) {
return true;
}
// you could also cache the sorted state and skip this if a list hasn't
// changed since the last sort
// not sure if native `List` does
list.Sort();
InnerList.Sort();
return InnerList.SequenceEqual(list);
}
protected int GetLongHash()
{
return .....
// something to create a reasonably good hash code -- which depends on the
// data. Adding all the numbers is probably fine, even if it fails a couple
// percent of the time you're still orders of magnitude ahead of sequence
// compare each time
}
}
If the lists won't change once added, this should be very fast. Even in situations where the lists could change frequently, the time to create a new hash code is not likely very different (if even greater at all) than doing a sequence compare.