How to approximate the count of distinct values in an array in a single pass through it

For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.

For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.


I'm pretty sure you can do it by:

  1. Create a Bloom filter
  2. Run through the array inserting each element into the filter (this is a "slow" O(n), since it requires computing several independent decent hashes of each value)
  3. Count how many bits are set in the Bloom Filter
  4. Compute back from the density of the filter to an estimate of the number of distinct values. I don't know the calculation off the top of my head, but any treatment of the theory of Bloom filters goes into this, because it's vital to the probability of the filter giving a false positive on a lookup.

Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.

I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.

So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.


One approach that can work, even for big values, is to spread them into lazily allocated buckets.

Suppose that you are working with 32 bits integers, creating an array of 2**32 bits is relatively impractical (2**29 bytes, hum). However, we can probably assume that 2**16 pointers is still reasonable (2**19 bytes: 500kB), so we create 2**16 buckets (null pointers).

The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null.

typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;

struct Comparator {
  bool operator()(Pair const& left, Pair const& right) const {
    return left.first < right.first;
  }
};

void add(Bucket& v, int32_t value) {
  Pair const pair(value, 1);
  Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
  if (it == v.end() or it->first > value) {
    v.insert(it, pair);
    return;
  }

  it->second += 1;
}

void gather(Vector& v, int32_t const* begin, int32_t const* end) {
  for (; begin != end; ++begin) {
    uint16_t const index = *begin >> 16;

    Bucket*& bucket = v[index];

    if (bucket == 0) { bucket = new Bucket(); }

    add(*bucket, *begin);
  }
}

Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.

A few notes:

  • the number of buckets is completely customizable (thus letting you control the amount of original memory)
  • the strategy of repartition is customizable as well (this is just a cheap hash table I have made here)
  • it is possible to monitor the number of allocated buckets and abandon, or switch gear, if it starts blowing up.
  • if each value is different, then it just won't work, but when you realize it, you will already have collected many counts, so you'll at least be able to give a lower bound of the number of different values, and a you'll also have a starting point for the top/bottom.

If you manage to gather those statistics, then the work is cut out for you.