Opposite of Coupon Collector / Birthday Problems?

The following develops an (approximate) MLE for $n$. I suppose that you take $s$ samples and see $d$ distinct elements. Note that

$$P( d \textrm{ distinct values in } s \textrm{ samples}; n) = \binom{n}{d} \frac{1}{n^s} \times\sum_{\substack{s_1, \dots s_d \ge 1\\ \sum s_i = s}} \frac{s!}{\prod s_i!} $$

Now note that the term after $\times$ depends only on $s,d$ so it the same for each value of $n$. Thus, the MLE of $n$ is $$ \arg\max_{ n \ge d} \log \binom{n}{d} -s \log n$$

Using Stirling's approximation, one can show that $$ \log \binom{n}{d} = nh(d/n) + \frac{1}{2}\log \frac{n}{d(n-d)} + O(1),$$ where $h$ is the binary entropy.

So the MLE is roughly the maximiser of $f(n) := nh(d/n) + \frac{1}{2} \log \frac{n}{n-d} - s\log n $ (assuming $n \gg 1$).

Now, $$ \partial_n f = \log \frac{n}{n-d} - \frac{s}{n} - \frac{d}{2n(n-d)}.$$ Setting this to $0$ gives $$ \hat{n}\left(\log \frac{\hat{n}}{\hat{n} - d} - \frac{d}{2(\hat{n}-d)}\right) = s,$$

which is very similar to the estimate presented by Interstellar probe. In particular, for $s = 1000, d = 344,$ it gives the solution $\hat{n} = 367.78$.

Usually, of course, the point estimate is less interesting than the error bars you can put around it, and how they change with the number of samples $s$. This seems a little non-trivial. I'll see if I can find something.


One can do the error-bar style analysis I mentioned for the mean estimator as proposed by Michael Lugo. For this:

Let $E_i := \{ \textrm{item i appears in the sample}\}.$ Let $$ f_1 := (1 - 1/n)^s \\ f_2 := (1- 2/n)^s $$

Note that \begin{align} P(E_i ) &= 1 - f_1 \\ P(E_i \cap E_j) &= 1 - 2f_1 + f_2 \end{align} the latter for $i \neq j$ (proof by inclusion-exclusion).

So, as we already have $$\mathbb{E}[D] = n (1 - f_1).$$

Further \begin{align} \mathbb{E}[D^2] &= n(1 - f_1) + (n^2 - n) (1 - 2f_1 + f_2)\\ &= n(f_1 - f_2) + n^2(f_2 - f_1^2) + n^2(1-f_1)^2\end{align}

which means that $$ \mathrm{var}[D]= n(f_1 - f_2) + n^2(f_2 - f_1^2) \le n f_1.$$ This means that $D$ concentrates well - with probability greater than $1 - \delta,$ $$ D \in n(1 - f_1) \pm \sqrt{ nf_1/\delta^2} $$

(The above is Tchebycheff's ineq. Note that the events $E_i$ are rather weakly correlated, since $f_2 \approx f_1^2 + O(sn^{-2} e^{-s/n})$ for large $s,n$, so quite possibly an exponential concentration can be derived using Dobrushin style methods. Too much work for now)

It remains to invert this, which is messy. I'll fill this in when I get the time. High level, I'm hoping for $O(\sqrt{D})$ error bars when $s \gtrsim n$.


Lastly, note that the generic support estimators give guarantees of estimating $n$ to within an error of $\varepsilon n$ with roughly $ \frac{n}{\log n} \log^2(1/\varepsilon)$ samples (see https://arxiv.org/abs/1504.01227). Note how amazing this is - you can get a really good estimate of the support size while (asymptotically) sampling much fewer items than the support size! (Of course, these will require knowing more than just $D$). I don't remember anymore, but I think there are lower bounds that are derived using uniform distributions that place a generic sample complexity constraint of $\Omega(n/\log n)$.


Let's say you have $n$ items, and you randomly sample them $k$ times. You wind up with $D$ distinct items. If your random sampling produced the mean number of distinct items, you would expect that

$$\sum_{i=0}^{D-1} \dfrac{n}{n-i} = k$$

To simplify,

$$k = n(H_n-H_{n-D}) \approx n\log \dfrac{n}{n-D} \Longrightarrow n \approx \dfrac{kD}{D\cdot W\left( -\dfrac{ke^{-k/D}}{D} \right)+k}$$

where $W$ is the Lambert-W function.

So, for your examples:

$$k=10^6, D=9.6\times 10^5, n\approx 12,164,395$$

$$k=1000,D=334, n \approx 355.29$$


Say I take $n$ samples and there are $N$ days in the year. The probability of observing any given birthday in these $n$ samples is $1-(1-1/N)^n$ and so the expected number of distinct birthdays is $f(N, n) = N (1-(1-1/N)^n)$.

In the particular numerical example you gave, $f(368, 1000) \approx 343.79$ and $f(369, 1000) \approx 344.54$, so if I observe 344 distinct birthdays among 1000 draws I should figure there are about 369 distinct days in the year.

Unfortunately solving $k = N (1-(1-1/N)^n)$ for $N$ looks unpleasant, but it should be easy to do numerically.