Efficiently Delete Duplicates
Don't use Catenate
; it is unpacking your lists!
On["Packing"]
Catenate[list] // ByteCount
Developer`FromPackedArray::punpack1: Unpacking array with dimensions {30}. >>
Developer`FromPackedArray::punpack1: Unpacking array with dimensions {30}. >>
Developer`FromPackedArray::punpack1: Unpacking array with dimensions {30}. >>
General::stop: Further output of Developer`FromPackedArray::punpack1 will be suppressed during this calculation. >>
1440000080
Join
instead:
Join @@ list // ByteCount (* no messages *)
480000144
- On this packed array
DeleteDuplicates
works OK. MaxMemoryUsed
is significant but not overwhelming for modern hardware.Union
is slightly more memory efficient if order does not matter.
Code:
DeleteDuplicates[Join @@ list] // ByteCount (* 479855632 *)
DeleteDuplicates[Join @@ list] // MaxMemoryUsed (* 2039547560 *)
Union @@ list // MaxMemoryUsed (* 1936000536 *)
I would use Union @@ list
:
a := RandomInteger[10^10, 30];
list = Table[a, 2*^6];
MaxMemoryUsed[Union @@ list;]/MaxMemoryUsed[2 list;]
2.01666
If it is important that no sorting is applied, the following uses less memory than your Fold
approach and its faster, though still slow (~sqrt(300)x more time)
Catenate[Reap[Nest[(Sow[First[#]]; DeleteCases[Rest[#],
Alternatives @@ First[#]]) &, list, Length[list]]][[2, 1]]]