Monitoring a data stream

An sms/tweet answer: you need less than $4$ million samples, and less than $2$ seconds on most modern computing platforms (including some smartphones), without sophisticated algorithms.

Below, there's a more complete answer, in six parts. The first part provides a rough intuition of what one may want to do. The second part details an implementation that attempts to strike a balance between simplicity and efficiency. The third part provides a rigorous proof of correctness (essentially applying Chernoff bounds to the probability of an incorrect answer). The fourth part estimates the time and computational resources required. The fifth part describes some optimizations that are not really worth the effort in this case, but may be for different problem parameters. The sixth part looks at how things change if you change the parameters of the problem, e.g. increasing the keyspace, changing the probability distributions, or reducing the available memory.

1 A ROUGH INTUITION

The basic idea is to choose a data structure that

a) can be “filled” with values just once and then is both b) compact and c) efficient to search for the presence or absence of a given value.

The simplest implementation would probably be a sorted array: in $2^{20}$ bytes one can store $2^{18}$ values, and check the presence or absence in the array of any given value with just $18$ comparisons through binary search.

After filling the data structure, one then simply checks if each further value produced by the data stream is present or absent. Informally, if the data structure holds $k$ values, and every $32$-bit value has a probability $2^{-32}$ of appearing in a given sample, then the probability of finding that value in the data structure is $k 2^{-32}$. If, on the other hand, half the values have a probability $\approx 2^{-31}$ of appearing in a given sample, and half the values have a probability $\approx 2^{-41}$, then almost all the values in the data structure are “high probability” ones and the probability that each new sample will be present in the data structure is $\approx k 2^{-31}$, i.e. about twice that for the case of equiprobable values. Then, determining whether values are equiprobable or not is simply a matter of taking a “sufficiently large” number of samples $s$, checking for each if a collision occurs, i.e. if the item is present in the data structure, and determining if the number of collisions is closer to $s k 2^{-32}$ (what we’d expect with equiprobable values) or to $s k 2^{-31}$ (what we’d expect otherwise).

What is a "sufficiently large" number of samples? A simple Chernoff bound shows that if the probability gap between the "high probability" and "low probability" values is sufficiently large (as in this case), you need of the order of $30 \frac{values in keyspace}{values in data structure} \log_e\frac{1}{error probability}$ samples (in addition to the samples needed to fill your data structure in the first place). The simple algorithm below keeps about $0.25$M samples in the data structure, and needs to look at about $3.4$M additional samples.

Note that to a first approximation, the number of samples one has to take to achieve a given confidence is inversely proportional to the number of values in the data structure. That’s why requisite b) (a “compact” data structure) is important: a data structure with twice the capacity requires half the samples. Of course, one has to process samples sufficiently quickly to keep up with the data stream. That’s why requisite c) (efficiency in determining whether a given sample produced a collision) is also important.

2 A SIMPLE IMPLEMENTATION

This is the pseudocode of a simple implementation of the scheme above. Formatting is ugly, but it should be legible and very easy to recode in your favourite programming language.

  1. $CAPACITY=2^{20}$ bytes/($4$ bytes/value)=$2^{18}$ values
  2. $CONFIDENCE = 0.999$
    • N.B. does not work for arbitrarily high CONFIDENCE, see analysis below
  3. allocate new array, capable of holding CAPACITY values
  4. for $i=0…CAPACITY-1$ array[i]=new_sample from datastream
  5. sort array
  6. $duplicates = 0$
  7. for $i=1…CAPACITY$ if (array[i]==array[i-1]) then $duplicates=duplicates+1$
  8. if $duplicates > CAPACITY/2$ restart from line 4!

  9. $SAMPLES=\frac{30 \cdot 2^{32} \cdot \log_e(\frac{1}{1-CONFIDENCE})}{CAPACITY-duplicates}$

  10. $THRESHOLD=40 \cdot \log_e(\frac{1}{1-CONFIDENCE})$

  11. $collisions = 0$

  12. for $i=0…SAMPLES$
    12a. sample = new_sample from datastream
    12b. if (array contains sample) then collisions=collisions+1
  13. if ($collisions \leq THRESHOLD$) then return EQUIPROBABLE else return NOT_EQUIPROBABLE

Note: To check if array contains sample just use binary search:

  1. $left=0$, $right=CAPACITY-1$
  2. while ($left<right$)

    2a. $middle =$ round_down$((left+right)/2)$

    2b. if $(sample > array[middle])$ then $left = middle+1$ else $right = middle$

  3. if array[left]==value return TRUE else return FALSE

3 TERMINATION AND CORRECTNESS

Note that the algorithm eventually terminates with an answer: the only possible loop is the "restart" on line 8, if more than half the values in memory are duplicates of some other value: this occurs with probability less than $1/2$ if the values stored in memory are less than $2^{30}$ (and with a really really tiny probability if they are $2^{18}$).

All we have to prove is that the probability of an incorrect answer is at most $1-CONFIDENCE$. In other words, we want to prove that the probability that $collisions > THRESHOLD$ if the values are equiprobable, and the probability that $collisions \leq THRESHOLD$ if the values are not equiprobable, are both at most $1-CONFIDENCE$.

To prove it, we use the simple Chernoff bound stating that

if the sum of an arbitrary number of independent random variables with values in $[0,1]$ has expectation $\mu$ then for any $\delta\in(0,1]$ the probabilities that the actual sum

a. is at least $(1+\delta)\mu$

b. is at most $(1-\delta)\mu$

are both no more than $e^{-\frac{\delta^2}{3}\mu}$.

Informally, since in one case we expect an average number of collisions that is about twice the other, we've chosen a threshold somewhere between the two, and a number of samples that ensures a number of collisions $\mu$ proportional to the logarithm of the inverse of the "tolerable" probability of error.

If the values are equiprobable, then it’s easy to see that the number of expected collisions is $expect_{equiprobable}=SAMPLES \cdot (CAPACITY -duplicates) \cdot 2^{-32}=30 \log_e(\frac{1}{1-CONFIDENCE})$. The probability that it's at least $(1+\frac{1}{3})$ times this value, i.e. at least THRESHOLD, is at most $e^{-\frac{(1/3)^2}{3}expect_{equiprobable}}<e^{- \log_e(\frac{1}{1-CONFIDENCE})}=1-CONFIDENCE$, and we are fine.

If the values are not equiprobable, let’s consider only collisions between “high probability” values and show that even just those alone are very likely to exceed THRESHOLD. We expect a fraction more than $1-2^{-10}$ of all values in the array, and of all subsequent samples from the data stream, to be “high probability” values. The probability that the fraction of “low probability” values exceeds $(1+\delta) 2^{-10}$ is in each case is at most, respectively, $e^{-\frac{\delta^2}{3}(2^{-10})(CAPACITY-duplicates)}$ and $e^{-\frac{\delta^2}{3}(2^{-10})(SAMPLES)}$. Taking the first of the two values (which is higher) and doubling it gives for e.g. $\delta=1$ a bound on the probability of having more than twice the expected number of “low probability” values. This probability is at most $p_{lowval}=2\cdot e^{-\frac{2^7}{12}}<0.00005$. If this is not the case, we are going to have at least $(1-2^{-9})(CAPACITY-duplicates)$ “high probability” values in our array, and at least $(1-2^{-9})SAMPLES$ “high probability” values amoung our samples. The expected number of collisions between “high probability” values is then at least $expect_{notequiprobable}=(1-2^{-9})^2 SAMPLES (CAPACITY-duplicates) 2^{-31}=60 (1-2^{-9})^2 \log_e(\frac{1}{1-CONFIDENCE})>\frac{3}{2} (1-2^{-8}) THRESHOLD$. Applying again the Chernoff bound with $1-\delta=\frac{1}{(3/2)(1-2^{-8})}$, and remembering to add the probability of more than twice the expected number of "low probability values" $p_{lowval}$ computed above, after some tedious number crunching we find again that the probability that not equiprobable samples yield more than $THRESHOLD$ collisions is at most $1-CONFIDENCE$; so again we are fine.

4 PERFORMANCE ANALYSIS

On most modern computing platforms (including laptops, and possibly some high performance smartphones) a relatively straightforward coding of the pseudocode above will need less than $4$ million samples and less than $2$ seconds to determine an answer with the desired confidence. To see this, note that the algorithm above and in fact any algorithm for similar tasks requires:

a) a preprocessing phase in which you “fill” the data structure

b) a certain number of additional samples from the datastream and

c) a certain amount of processing for each of these samples. If the processing time/sample is less than the time between samples, your algorithm will be waiting between one sample and the next; if it is greater, all that will happen is that you’ll be forced to discard a fraction of samples from the datastream, and take proportionally longer.

Let’s look at the terms individually.

a) In terms of preprocessing, the algorithm simply fills the memory with values, sorts them, and counts duplicates. Modern sorting routines can easily sort integers at well over $10MB/s$; one more pass to count duplicates takes almost negligible time, filling your $2^{20}$ bytes of data structure will take less than $0.2s$ (i.e. $0.1s$ to read, less than $0.1$s to process).

b) In terms of subsequent samples, for the desired accuracy you need $30 \cdot 2^{14} \cdot log_e (1/0.001)<3.4M$ samples. Your datastream produces them at $2.5M$/second, so if you can keep up with the datastream, you’ll need less than $1.5$ seconds.

c) Let’s look at the processing time for each sample (after initializing the data structure)! For each sample, the algorithm requires only $18$ lookups and comparisons/sample, plus possibly increasing two counters (collisions and samples seen) by $1$ each. On most modern computers (in fact, even on some smartphone processors) the entire $2^20$ bytes of the data structure fits in the processor cache, and performing those operations probably takes, in total, no more than $200$ cycles - so on a $1Ghz$ processor you can process $5$ million values per second, which is $20$ Mbytes/second or twice what you need.

5 FURTHER OPTIMIZATIONS

To a first approximation, if you can double the number of samples in your data structure, you can halve the number of samples you need afterwards.

One way to increase the number of values stored is to use some compression scheme; a simple one that works really well is to note that you can group your data into $2^{16}$ "bins" each with the same $2^{16}-bit$ prefix, which you then do not need to write down. This "reorganization" is almost free with most efficient sorting schemes, does not significantly slow down search (in fact, it can speed it up if for each "bin" you annotate the start and end position, and you quickly "zoom" to it) and saves you almost a factor $2$ in space (and thus in samples).

A more sophisticated scheme involves using data structures such as Bloom filters, or the even better [i]Cuckoo filters[/i], that only return a probabilistic answer to the query "have I seen this value before"? After all, since you are looking at two distributions that yield collisions differing by almost a factor $2$, a small fraction of errors only minimally increases the number of samples needed to separate them. If you are aiming for a $5\%$ error in your collision count, for example, you can do it with no more than $1$ byte/value. The problem with these solutions is that they are more complex to code, more complex to analyze, and ultimately more computationally onerous (for the current parameters of the system) than the simple "compressed-store-and search" above, so depending on your computing platform you may not manage to keep up with the data stream.

Note that with the current implementation and in the current scenario you have about $0.25$M values in the data structure, and about $3.4M$ subsequent samples (which is a bound that's loose, but not too loose), so you are already relatively close to the optimum, which would be a little less than $1M$ samples in the data structure and a little less than $1M$ subsequent samples - saving you about a factor $2$ in terms of time and samples. It's probably not worth the effort; but it may be if you change the parameters of the problem (see below).

6 CHANGING THE PARAMETERS OF THE PROBLEM

What if you want more confidence, or you have less space, or a larger key-space? Or if you have less memory?

In terms of accuracy, note that the amount of samples needed is roughly proportional to the logarithm of the inverse of the "tolerable" error rate. So if you want $0.999 999$ confidence you'll need twice the samples needed for $0.999$ confidence. In other terms, you can grow your confidence very quickly by adding few samples. This is not a property of the scheme above, but intrinsic to the fact that you are counting sums of independent variables.

What if one changes the probability of "low probability" values, or their fraction? Ultimately, what really counts is the expected number of collisions in the two cases. Remember the Chernoff bound above, which is not tight but reasonably so: your probability of erring on this count by a factor $1+\delta$ or $1-\delta$ is $e^{-\frac{\delta^2}{3}\mu}$, where $\mu$ is the expected number of collisions. If your two distributions produce a number of collisions within a factor $1+\epsilon$ of each other, you'll need to make $\delta\approx \epsilon/2$, and the number of samples you need will grow quadratically with $1/\epsilon$ (i.e. halve $\epsilon$, quadruple the number of samples needed). If the distribution with the highest collision count sees its collision count fall, $\mu$ will fall, and no matter how few collisions the other distribution has, your number of samples will have to increase, inversely proportional with $\mu$.

This also tells you what happens if your keyspace grows (e.g. from $2^{32}$ possible values to $2^{64}$ possible values): the number of collisions proportionally decreases, and the number of samples needed proportionally increases. E.g. moving up to $2^{64}$ possible values increases the number of needed samples by a huge factor $2^{32}$.

It also tells you what happens if you reduce the available memory - again, your proportionally decrease the number of collisions, and proportionally increase the number of samples: half the memory, half the collisions, twice the samples.


In situation $A$, we're drawing from $2^{32}$ possible values. To a first approximation, in situation $B$, we're drawing from $2^{31}$ possible values.

The only reasonable thing to do is to count collisions.

Option A: Fill memory with samples, then count

Read the first $2^{18}$ values and store them in memory. To a first approximation, they are all different.

Now read $N$ more values, and increment a counter whenever the value collides with one of the values we've stored. The final counter $c$ is approximately a Poisson distribution. In situation $A$, its mean and variance are equal to $2^{-14}N$, and in situation $B$, its mean and variance are equal to $2^{-13}N$. Let us guess $A$ if $c<x$ and $B$ if $c>x$ for some decision boundary $x$.

We would like a wrong guess to be equally unlikely in both cases, so we want $c$ to be a bit closer to the expectation of $A$, but since everything is "rough", let's put it halfway between the means, at a distance of $2^{-15}N$. Now if we want that to be at least three standard deviations away, with a worst-case variance of $2^{-13}N$, then $$2^{-15}N\geq3\sqrt{2^{-13}N}$$ $$2^{-30}N^2\geq9\times2^{-13}N$$ $$N \geq 9\times2^{17}= 1179648$$ so we need about 1.2 million values. Adding that to the original $2^{18}$, we need about 1.5 million values in all. You can get away with less, if you calculate a better decision boundary, but since we're getting 10 MB/sec, it's not really worth the added effort.

Actually, I doubt that this algorithm can keep up with the data stream; it's probably limited by the time it takes to search the storage for the incoming values.

Option B: Bloom filter

A Bloom filter would be both faster and more space-efficient, allowing for a better balance between the storage phase and the test phase. It would also allow us to combine the two phases, testing for collisions with every addition. You could probably get under half a million values if you're clever about it. You'd want to tune the filter error rate carefully.