c++ Algorithm to Compare various length vectors and isolate "unique", sort of
Loop through the vectors and for each vector, map the count of unique values occurring in it. unordered_map<int, int>
would suffice for this, let's call it M
.
Also maintain a set<unordered_map<int, int>>
, say S
, ordered by the size of unordered_map<int, int>
in decreasing order.
Now we will have to compare contents of M
with the contents of unordered_map
s in S
. Let's call M'
, the current unordered_map
in S
being compared with M
. M
will be a subset of M'
only when the count of all the elements in M
is less than or equal to the count of their respective elements in M'
. If that's the case then it's a duplicate and we'll not insert. For any other case, we'll insert. Also notice that if the size of M
is greater than the size of M'
, M
can't be a subset of M'
. That means we can insert M
in S
. This can be used as a pre-condition to speed things up. Maintain the indices of vectors which weren't inserted in S
, these are the duplicates and have to be deleted from vector_list
in the end.
Time Complexity: O(N*M) + O(N^2*D) + O(N*log(N)) = O(N^2*D)
where N
is the number of vectors in vector_list
, M
is the average size of the vectors in vector_list
and D
is the average size of unordered_map
's in S
. This is for the worst case when there aren't any duplicates. For average case, when there are duplicates, the second complexity will come down.
Edit: The above procedure will create a problem. To fix that, we'll need to make unordered_map
s of all vectors, store them in a vector V
, and sort that vector in decreasing order of the size of unordered_map
. Then, we'll start from the biggest in this vector and apply the above procedure on it. This is necessary because, a subset, say M1
of a set M2
, can be inserted into S
before M2
if the respective vector of M1
comes before the respective vector of M2
in vector_list
. So now we don't really need S
, we can compare them within V
itself. Complexity won't change.
Edit 2: The same problem will occur again if sizes of two unordered_map
s are the same in V
when sorting V
. To fix that, we'll need to keep the contents of unordered_map
s in some order too. So just replace unordered_map
with map
and in the comparator function, if the size of two map
s is the same, compare element by element and whenever the keys are not the same for the very first time or are same but the M[key]
is not the same, put the bigger element before the other in V
.
Edit 3: New Time Complexity: O(N*M*log(D)) + O(N*D*log(N)) + O(N^2*D*log(D)) = O(N^2*D*log(D))
. Also you might want to pair the map
s with the index of the respective vectors in vector_list
so as to know which vector you must delete from vector_list
when you find a duplicate in V
.
IMPORTANT: In sorted V
, we must start checking from the end just to be safe (in case we choose to delete a duplicate from vector_list
as well as V
whenever we encounter it). So for the last map
in V
compare it with the rest of the map
s before it to check if it is a duplicate.
Example:
vector_list = { {1, 2, 3}, {2, 3, 1}, {3, 2, 2}, {4, 2, 3, 2, 5}, {1, 2, 3, 4, 6, 2}, {2, 3, 4, 5, 6}, {1, 5} }
Creating map
s of respective vectors:
V = { {1->1, 2->1, 3->1}, {1->1, 2->1, 3->1}, {2->2, 3->1}, {2->2, 3->1, 4->1, 5->1}, {1->1, 2->2, 3->1, 4->1, 6->1}, {2->1, 3->1, 4->1, 5->1, 6->1}, {1->1, 5->1} }
After sorting:
V = { {1->1, 2->2, 3->1, 4->1, 6->1}, {2->1, 3->1, 4->1, 5->1, 6->1}, {2->2, 3->1, 4->1, 5->1}, {1->1, 2->1, 3->1}, {1->1, 2->1, 3->1}, {1->1, 5->1}, {2->2, 3->1} }
After deleting duplicates:
V = { {1->1, 2->2, 3->1, 4->1, 6->1}, {2->1, 3->1, 4->1, 5->1, 6->1}, {2->2, 3->1, 4->1, 5->1}, {1->1, 5->1} }
Edit 4: I tried coding it up. Running it a 1000 times on a list of 100 vectors, the size of each vector being in range [1-250], the range of the elements of vector being [0-50] and assuming the input is available for all the 1000 times, it takes around 2 minutes on my machine. It goes without saying that there is room for improvement in my code (and my machine).