Efficiently merge lists chronologically without duplicates?
Update: You can use TopologicalSort
after turning each list to a list of rules reflecting the ordering of its elements:
ClearAll[topoSort]
topoSort = TopologicalSort[Flatten[Thread[Most@# -> Rest@#] & /@ #]] &;
lists = {{a1, a5}, {a3, a2, a5}, {a1, a2}} ;
topoSort @ lists
{a1, a3, a2, a5}
Timings: Using a scaled down version of OP's data:
SeedRandom[1]
{n, m, k} = {1000, 300, 500};
alist = RandomSample[Table[ToExpression["a" <> ToString[i]], {i, 1, n}]];
range = Range[n];
Do[
len = RandomInteger[{m, n}];
select = Sort[RandomSample[range, len]];
sublists[j] = alist[[select]];
, {j, 1, k}]
allsublists = Array[sublists, k];
f = mergeChron[allsublists]; // RepeatedTiming // First
1.14
g = merge[allsublists]; // RepeatedTiming // First
1.1
h = topoSort[allsublists]; // RepeatedTiming // First
0.313
Equal[f, g, h]
True
With {n, m, k} = {2000, 600, 1000}
, the timings are 9.44
, 5.69
and 1.64
for mergeChron
, merge
and topoSort
, respectively,
Original answer: (this needs further work ...)
Fold[Experimental`ShortestSupersequence, sublists /@ Range[3]]
{a3, a1, a2, a5}
Also
Fold[Experimental`ShortestSupersequence, RotateRight[sublists /@ Range[3]]]
{a1, a3, a2, a5}
In case this produces an output with duplicates we can use
DeleteDuplicates @ Fold[Experimental`ShortestSupersequence, sublists /@ Range[5000]]
I think the following works but please give it a try with your data. I've tried quite a few random examples and it has worked for all of them.
First, construct a directed graph with edges defined by neighbors in the sublists:
allsublists = Array[sublists, 5000]; (* or whatever the max count is *)
G = Graph[Join @@ (BlockMap[Apply[Rule], #, 2, 1] & /@ allsublists)]
Next, look at the GraphDistanceMatrix
and count how many times the symbol ∞
appears in each row:
infcount = Count[∞] /@ GraphDistanceMatrix[G]
Next, sort the graph vertices according to this infcount
:
alist = SortBy[Transpose[{VertexList[G], infcount}], Last][[All, 1]]
All together in one function:
merge[L_] :=
With[{G = Graph[Join @@ (BlockMap[Apply[Rule], #, 2, 1] & /@ L)]},
SortBy[Transpose[{VertexList[G], Count[∞] /@ GraphDistanceMatrix[G]}], Last][[All, 1]]]
example
The given example is
allsublists = Array[sublists, 3]
(* {{a1, a5}, {a3, a2, a5}, {a1, a2}} *)
Construct the directed neighbor graph:
G = Graph[Join @@ (BlockMap[Apply[Rule], #, 2, 1] & /@ allsublists),
VertexLabels -> Automatic]
Have a look at the graph distance matrix:
VertexList[G]
(* {a1, a5, a3, a2} *)
GraphDistanceMatrix[G]
(* {{0, 1, ∞, 1},
{∞, 0, ∞, ∞},
{∞, 2, 0, 1},
{∞, 1, ∞, 0}} *)
count the infinities in each row, i.e., for each vertex we count how many other vertices are unreachable:
infcount = Count[∞] /@ GraphDistanceMatrix[G]
(* {1, 3, 1, 2} *)
sort the vertices by the number of infinities (i.e. by the number of unreachable other vertices):
alist = SortBy[Transpose[{VertexList[G], infcount}], Last][[All, 1]]
(* {a1, a3, a2, a5} *)
Using PositionIndex
and Merge
,
mergeChron[x_] := Merge[PositionIndex /@ x, Max] // Sort // Keys;
Test:
Case1:
allsublists= {{a1, a5}, {a3, a2, a5}, {a1, a2}};
mergeChron[allsublists]
{a1, a3, a2, a5}
Case2:
alist = RandomSample[Table[ToExpression["a" <> ToString[i]], {i, 1, 1000}]];
range = Range[1000];
Do[len = RandomInteger[{450, 1000}];
select = Sort[RandomSample[range, len]];
sublists[j] = alist[[select]];, {j, 1, 300}]
For num = 300
, all three of the methods produce the same output. However, that's not the case for num = 30
. It might be because of the existence of multiple solutions (as discussed in comments).
num = 300;
allsublists = Array[sublists, num];
f = mergeChron[allsublists];
g = merge[allsublists]; (*Roman's*)
h = topoSort[allsublists]; (*kglr's*)
f == g == h
True