Algorithm - How to delete duplicate elements in a list efficiently?
Special Case: Hashing and Equality
Firstly, we need to determine something about the assumptions, namely the existence of an equals and has function relationship. What do I mean by this? I mean that for the set of source objects S, given any two objects x1 and x2 that are elements of S there exists a (hash) function F such that:
if (x1.equals(x2)) then F(x1) == F(x2)
Java has such a relationship. That allows you to check to duplicates as a near O(1) operation and thus reduces the algorithm to a simple O(n) problem. If order is unimportant, it's a simple one liner:
List result = new ArrayList(new HashSet(inputList));
If order is important:
List outputList = new ArrayList();
Set set = new HashSet();
for (Object item : inputList) {
if (!set.contains(item)) {
outputList.add(item);
set.add(item);
}
}
You will note that I said "near O(1)". That's because such data structures (as a Java HashMap or HashSet) rely on a method where a portion of the hash code is used to find an element (often called a bucket) in the backing storage. The number of buckets is a power-of-2. That way the index into that list is easy to calculate. hashCode() returns an int. If you have 16 buckets you can find which one to use by ANDing the hashCode with 15, giving you a number from 0 to 15.
When you try and put something in that bucket it may already be occupied. If so then a linear comparison of all entries in that bucket will occur. If the collision rate gets too high or you try to put too many elements in the structure will be grown, typically doubled (but always by a power-of-2) and all the items are placed in their new buckets (based on the new mask). Thus resizing such structures is relatively expensive.
Lookup may also be expensive. Consider this class:
public class A {
private final int a;
A(int a) { this.a == a; }
public boolean equals(Object ob) {
if (ob.getClass() != getClass()) return false;
A other = (A)ob;
return other.a == a;
}
public int hashCode() { return 7; }
}
This code is perfectly legal and it fulfills the equals-hashCode contract.
Assuming your set contains nothing but A instances, your insertion/search now turns into an O(n) operation, turning the entire insertion into O(n2).
Obviously this is an extreme example but it's useful to point out that such mechanisms also rely on a relatively good distribution of hashes within the value space the map or set uses.
Finally, it must be said that this is a special case. If you're using a language without this kind of "hashing shortcut" then it's a different story.
General Case: No Ordering
If no ordering function exists for the list then you're stuck with an O(n2) brute-force comparison of every object to every other object. So in Java:
List result = new ArrayList();
for (Object item : inputList) {
boolean duplicate = false;
for (Object ob : result) {
if (ob.equals(item)) {
duplicate = true;
break;
}
}
if (!duplicate) {
result.add(item);
}
}
General Case: Ordering
If an ordering function exists (as it does with, say, a list of integers or strings) then you sort the list (which is O(n log n)) and then compare each element in the list to the next (O(n)) so the total algorithm is O(n log n). In Java:
Collections.sort(inputList);
List result = new ArrayList();
Object prev = null;
for (Object item : inputList) {
if (!item.equals(prev)) {
result.add(item);
}
prev = item;
}
Note: the above examples assume no nulls are in the list.
Assuming order matters:
- Create an empty set S and an empty list M.
- Scan the list L one element at a time.
- If the element is in the set S, skip it.
- Otherwise, add it to M and to S.
- Repeat for all elements in L.
- Return M.
In Python:
>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> S = set()
>>> M = []
>>> for e in L:
... if e in S:
... continue
... S.add(e)
... M.append(e)
...
>>> M
[2, 1, 4, 3, 5, 6]
If order does not matter:
M = list(set(L))