Julia: How to return the number of unique elements in an Array
If you want an exact answer length(unique(x))
is as efficient as it gets for general objects. If your values have a limited domain, eg UInt8
, it may be more efficient to use a fixed size table. If you can accept an approximation, then you can use the HyperLogLog data structure / algorithm, which is implemented in the OnlineStats package:
https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStats.HyperLogLog
It appears that length(Set(x))
is somewhat faster than length(unique(x))
.
julia> using StatsBase, BenchmarkTools
julia> num_unique(x) = length(Set(x));
julia> a = sample(1:100, 200);
julia> num_unique(x) == length(unique(x))
true
julia> @benchmark length(unique(x)) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial:
memory estimate: 450.50 KiB
allocs estimate: 36
--------------
minimum time: 498.130 μs (0.00% GC)
median time: 570.588 μs (0.00% GC)
mean time: 579.011 μs (2.41% GC)
maximum time: 2.321 ms (63.03% GC)
--------------
samples: 5264
evals/sample: 1
julia> @benchmark num_unique(x) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial:
memory estimate: 288.68 KiB
allocs estimate: 8
--------------
minimum time: 283.031 μs (0.00% GC)
median time: 393.317 μs (0.00% GC)
mean time: 397.878 μs (4.24% GC)
maximum time: 33.499 ms (98.80% GC)
--------------
samples: 6704
evals/sample: 1
And another benchmark for an array of strings:
julia> using Random
julia> @benchmark length(unique(x)) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial:
memory estimate: 450.50 KiB
allocs estimate: 36
--------------
minimum time: 818.024 μs (0.00% GC)
median time: 895.944 μs (0.00% GC)
mean time: 906.568 μs (1.61% GC)
maximum time: 1.964 ms (51.19% GC)
--------------
samples: 3049
evals/sample: 1
julia> @benchmark num_unique(x) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial:
memory estimate: 144.68 KiB
allocs estimate: 8
--------------
minimum time: 367.018 μs (0.00% GC)
median time: 378.666 μs (0.00% GC)
mean time: 384.486 μs (1.07% GC)
maximum time: 1.314 ms (70.80% GC)
--------------
samples: 4527
evals/sample: 1
if you don't need the x array after, length(unique!(x))
is slightly faster.
with Floats and Integers, you can use map reduce if your array is already sorted.
function count_unique_sorted(x)
f(a) = (a,0)
function op(a,b)
if a[1] == b[1]
return (b[1],a[2])
else
return (b[1],a[2]+1)
end
end
return mapreduce(f,op,x)[2]+1
end
If you don't care about the order of the array x
, you can sort and count in one function:
count_unique_sorted!(x)=count_unique_sorted(sort!(x))
Some benchmarks:
using Random,StatsBase, BenchmarkTools
x = sample(1:100,200)
length(unique(x)) == count_unique_sorted(sort(x)) #true
Using length(unique(x))
:
@benchmark length(unique(x))
BenchmarkTools.Trial:
memory estimate: 6.08 KiB
allocs estimate: 17
--------------
minimum time: 3.350 μs (0.00% GC)
median time: 3.688 μs (0.00% GC)
mean time: 5.352 μs (24.35% GC)
maximum time: 6.691 ms (99.90% GC)
--------------
samples: 10000
evals/sample: 8
Using Set
:
@benchmark length(Set(x))
BenchmarkTools.Trial:
memory estimate: 2.82 KiB
allocs estimate: 8
--------------
minimum time: 2.256 μs (0.00% GC)
median time: 2.467 μs (0.00% GC)
mean time: 3.654 μs (26.04% GC)
maximum time: 5.297 ms (99.91% GC)
--------------
samples: 10000
evals/sample: 9
Using count_unique_sorted!
:
x2 = copy(x)
@benchmark count_unique_sorted!(x2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 948.387 ns (0.00% GC)
median time: 990.323 ns (0.00% GC)
mean time: 1.038 μs (0.00% GC)
maximum time: 2.481 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 31
Using count_unique_sorted
with an already sorted array
x3 = sort(x)
@benchmark count_unique_sorted(x3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 140.962 ns (0.00% GC)
median time: 146.831 ns (0.00% GC)
mean time: 154.121 ns (0.00% GC)
maximum time: 381.806 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 852
Using count_unique_sorted
and sorting the array
@benchmark count_unique_sorted(sort(x))
BenchmarkTools.Trial:
memory estimate: 1.77 KiB
allocs estimate: 1
--------------
minimum time: 1.470 μs (0.00% GC)
median time: 1.630 μs (0.00% GC)
mean time: 2.367 μs (21.82% GC)
maximum time: 4.880 ms (99.94% GC)
--------------
samples: 10000
evals/sample: 10
For strings, sorting and counting is slower than making a Set.