"reverse birthday problem" - inferring days in the year from collisions in sample

Nice question!

It has much in common with the German tank problem, so you might want to look at that article to get some more ideas.

Let's denote the number of days in the alien year by $d$.

You could certainly estimate $d$ from the number $X$ of students involved in collisions. That's not the best way to do it, but one advantage that it does have is that you can readily calculate the expected number of such students in terms of $d$ and solve for $d$.

The probability that a given student is involved in a collision is $1-\left(1-\frac1d\right)^{n-1}$, so the expected number of students involved in collisions is

$$ E[X]=n\left(1-\left(1-\frac1d\right)^{n-1}\right)\;. $$

Solving for $d$ yields:

$$ d=\frac1{1-\sqrt[n-1]{1-E[X]/n}}\;. $$

As a rather crude estimate, you could plug in the value $X=x$ that you observed as if it were the expected value, yielding the estimate

$$ d=\frac1{1-\sqrt[n-1]{1-x/n}}\;. $$

This doesn't make much sense in extreme cases: If you didn't observe any collisions, the result is infinite, and if all $n$ students were involved in collisions, the result is $d=1$. We'll see later that there's not much you can do in the first case, but the unhelpful result in the second case is due to the suboptimal use of $x$.

The reason that the number of students involved in collisions isn't the best way to use your data is that it's not a sufficient statistic. Ideally, you want to summarize your data in a sufficient statistic, that is, a statistic that contains all the information about $d$ that your data contain. This is not the case for the number of students involved in collisions, since $4$ students all with the same birthday and $4$ students with two pairs of birthdays contain different information about $d$, but both cases contribute $4$ to the number of students involved in collisions.

A sufficient statistic is given by the number $K$ of different birthdays among the students. For instance, while one student having a unique birthday and three students having the same birthday has a different probability than two pairs of students sharing birthdays, the ratio of these probabilities doesn't depend on $d$, so the two cases contain the same information about $d$.

So let's try to estimate $d$ given $K$ (and $n$, which is part of the setup, not of the data).

From a frequentist viewpoint, we'd like to get an unbiased estimator; that is, an estimator such that if you carry out the same experiment many times, the expected value of the estimator is the true value of the parameter $d$. However, such an estimator doesn't exist in this case, as we can only get $n$ different values of $K$, whereas there are infinitely many values of the parameter $d$ to be estimated. (This is a qualitative difference from the German tank problem, which does allow for an unbiased estimator.)

So let's try a Bayesian approach. It seems reasonable to assume a flat prior, i.e., we assign the same a priori likelihood to all possible values of $d$. Up to factors independent of $d$, the probability to observe $K=k$ different birthdays among $n$ students is proportional to

$$ \binom dk\left(\frac kd\right)^n\;, $$

since we can choose $k$ out of $d$ days in $\binom dk$ ways and the $n$ students each have probability $\frac kd$ to have their birthday on one of these days. (The actual probability involves a more complicated calculation to make sure that all $k$ birthdays do in fact occur, but the resulting factors are independent of $d$.)

For $k=n$, this value tends towards $1$ from below for $d\to\infty$, so there's no finite maximum likelihood estimate. This corresponds to the case above where $x=0$ yielded infinite $d$. You need at least one collision to get any information at all about $d$.

So let's look at the case $k=n-1$, with a single collision. Here's a plot for $n=10$, $k=9$. Here we do have a maximum in the likelihood; in the example in the plot it occurs at $d=42$. So if you observe $10$ students and they have $9$ different birthdays, corresponding to one collision involving two of them, the maximum likelihood estimate (assuming a flat prior) would be $d=42$. Our estimate above, based on the calculation of the expected value of the number of students involved in collisions, yields, with $x=2$ and $n=10$:

$$ d=\frac1{1-\sqrt[9]{1-2/10}}\approx40.8\;, $$

in good agreement (which I suspect is slightly coincidental).

Another way to use the likelihood would be to calculate the expectation of $d$. However, this doesn't exist in the current case of a single collision, since in this case the likelihood only decays as $d^{-1}$, so we can't sum it to infinity. Even with two collisions (or one triple collision), i.e. $k=n-2$, though the likelihood decays as $d^{-2}$ and is thus summable, it still isn't summable when we multiply it by $d$ to obtain the expected value, so again the expected value doesn't exist in this case. As in the German tank problem, we need at least three collisions, i.e. $k\le n-3$, to obtain a finite expectation.

So let's see what happens for $n=10$, $k=7$. The maximum likelihood is at $d=12$ (here's a plot). The expected value of $d$ is approximately $30.4$, i.e. much greater, due to the long tail. The estimate above using $x$ comes out different depending on the collisions. If we have $3$ pairs of students sharing birthdays, that makes $x=6$, and the estimate is

$$ d=\frac1{1-\sqrt[9]{1-6/10}}\approx10.3\;; $$

if we have one pair and one triplet, that makes $x=5$, and the estimate is

$$ d=\frac1{1-\sqrt[9]{1-5/10}}\approx13.5\;; $$

and if we have one quadruplet, that makes $x=4$, and the estimate is

$$ d=\frac1{1-\sqrt[9]{1-4/10}}\approx18.1\;, $$

all of which are more in line with the maximum likelihood estimation than with the expected value of $d$.

As usual, we'd expect the differences between the various approaches to become less pronounced as we get more data. So let's see what happens for $n=100$, $k=90$. Here's a plot of the likelihood function, which looks somewhat more like a Gaussian now than it did before. The maximum likelihood is at $d=461$. The expected value of $d$ is about $569.0$. So there's still a considerable difference, but the agreement is considerably better. Using the approach with the number of students involved in collisions, in this case you could get an estimate anywhere from

$$ d=\frac1{1-\sqrt[99]{1-20/100}}\approx444.2 $$

for the case where $10$ pairs of students share a birthday, leading to $d=20$, to

$$ d=\frac1{1-\sqrt[99]{1-11/100}}\approx850.0 $$

for the unlikely case where $11$ students share one birthday, leading to $d=11$.

The most likely case, with eight pairs and one triplet, and thus $d=19$, yields

$$ d=\frac1{1-\sqrt[99]{1-19/100}}\approx470.3\;, $$

which is quite close to the maximum likelihood estimate using the number of different birthdays. So you might want to use the maximum likelihood estimator, which is also somewhat easier to calculate than the expected value of $d$.

In the other extreme case $k\ll n$, both the maximum likelihood estimate and the expected value of $d$ tend to $k$. For instance, for $n=100$, $k=10$, the maximum likelihood estimate is $d=10$, and the expected value of $d$ is about $10.0008$. Note that this is a much more meaningful result than the estimate $d=1$ we got for $x=n$ above. For $k\ll n$, almost all students are involved in collisions, no matter how many days the year has, so you can't conclude anything about $d$ from $x$ in that case, whereas $k$ in that case gives you a very precise estimate of $d$.