Simple example of "Maximum A Posteriori"

Imagine you sent a message $S$ to your friend that is either $1$ or $0$ with probability $p$ and $1-p$, respectively. Unfortunately that message gets corrupted by Gaussian noise $N$ with zero mean and unit variance. Then what your friend would receive is a message $Y$ given by

$$Y = S + N$$

Given that what your friend observed was that $Y$ takes a particular value $y$, that is $Y = y$, he wants to know which was, probably, the value of $S$ that you sent to him. In other words, the value $s$ that maximizes the posterior probability

$$P(S = s \mid Y = y)$$

That last sentence can be written as

$$\hat{s} = \arg\max_s P(S = s \mid Y = y)$$

What follows is to compute $P(S = s \mid Y = y)$ for $S=1$ and $S=0$ and then to pick the value of $S$ for which that probability is greater. We are calling that value $\hat{s}$.

It is sometimes easier to model the uncertainty about a consequence given a cause than the other way around, namely the distribution of $Y$ given $S$, $f_{Y \mid S}(y \mid s)$, rather than $P(S = s \mid Y = y)$. So, let's find out first what is the former to be worried later about the latter.

Given that $S=0$, $Y$ becomes equal to the noise $N$, and therefore

$$f_{Y \mid S}(y \mid 0) = \frac{1}{\sqrt{2\pi}}e^{-y^2/2}\tag{1}$$

Given that $S=1$, $Y$ becomes $Y = N + 1$ , which is just $N$ but "displaced" by $1$ unit, therefore it is also a Gaussian random variable with unit variance but with mean now equal to $1$, thus

$$f_{Y \mid S}(y \mid 1) = \frac{1}{\sqrt{2\pi}}e^{-(y-1)^2/2}\tag{2}$$

How do we compute now $P(S = s \mid Y = y)$? By using Bayes's rule, we have

\begin{align} P(S = 0 \mid Y = y) &= \frac{f_{Y\mid S}(y \mid 0)P(S = 0)}{f_Y(y)}\\ \end{align}

\begin{align} P(S = 1 \mid Y = y) &= \frac{f_{Y\mid S}(y \mid 1)P(S = 1)}{f_Y(y)}\\ \end{align}

We would get $\hat{s}=1$ if

$$P(S = 1 \mid Y = y) \gt P(S = 0 \mid Y = y)$$

or equivalently if

$$f_{Y\mid S}(y \mid 1)p \gt f_{Y\mid S}(y \mid 0)(1-p)\tag{3}$$

This last expression wouldn't help too much to your friend, what he really needs is a criterion based on the value of $Y$ he observed and the known statistics. To achieve that it's possible that what follows makes this example no longer simple, but let's give it an opportunity.

Replacing $(1)$ and $(2)$ in $(3)$ and taking the natural logarithm at both sides, we get

$$-\frac{(y-1)^2}{2}+\text{log}(p) \gt -\frac{y^2}{2}+\text{log}(1-p)$$

which can be simplified to

$$y \gt \frac{1}{2} + \text{log}\left( \frac{1-p}{p} \right)\tag{4}$$

Now this is more helpful. Your friend just has to check if the observed value of $y$ satisfies that inequality to decide if $S=1$ was sent or not. In other words, if the observed value $y$ satisfies $(4)$, then the value that maximizes the posterior probability $P(S = s \mid Y = y)$ is $S=1$, and therefore $\hat{s} = 1$.

Aside note:
The result given by $(4)$ is quite intuitive. If $0$ and $1$ are equiprobable, i.e. $p=1/2$, we would choose $S=1$ if $y > 1/2$. That is, we put our threshold right in the middle of $0$ and $1$. If $1$ is more probable ($p \gt 1/2$), then the threshold in now closer to $0$, thus favoring $S=1$, which makes sense because it the most probable one.

Consider flipping fair coins. The outcome of a flip is described by the random variable $C$ that can take on the values heads (h) and tails (t). The probability of heads is $P(C=h) = 0.5$ and the probability of tails is $P(C=t) = 1 - P(C=h)$. Consider also flipping a biased coin whose probability of turning up heads is $P(C=h) = 0.7$ and tails $P(C=t) = 1 - P(C=h) = 0.3$. We can easily solve problems such as calculating the probability of getting three tails in a row.

We can make the model more general by introducing a parameter k to describe the probability of getting heads. We write that as $P(C=h)=k$ and $P(C=t)=1-k$. The model becomes more complicated, but we can now describe coins with any bias!

We can now ask questions like "Given a coin with bias $k$, what is the probability of getting 2 heads and 3 tails?" The answer is:

$$ P(C=h)P(C=h)P(C=t)P(C=t)P(C=t) $$ which simplifies to $$ P(C=h)^2P(C=t)^3 = k^2(1-k)^3 $$

Using this we can calculate the maximum likelihood estimate (MLE) of the parameter $k$ given the flips we have observed. How do we do that? Remember calculus and the method for finding stationary points? Yes! We derive the expression and set it equal to 0:

$$ D\, k^2(1-k)^3 = 2k(1-k)^3 - 3k^2(1-k) = 0 $$

Solving for $k$ yields $k = 0.4$. So according to the MLE the probability of gettings heads is 40%.

That's it for the MLE. Now let's tackle the maximum a posteriori estimate (MAP). To compute it we need to think about the parameter $k$. The idea behind MAP is that we have some rough idea about how biased our coin is. Then we flip it a few times and update our rough idea.

Our "rough idea" is called the prior, the coin flips the observations, and our "rough idea," after considering the obvervations, the posterior.

The big epiphany that brings us from MLE to MAP is that the parameter $k$ should be thought of as a random variable! At first it seems strange to think of a probability as a random variable, but it makes perfect sense after a while. A priori, we don't know how biased the coin is but our hunch could be that it is biased in favor of heads.

We therefore introduce the random variable $K$ and say that its values are drawn from the Beta distribution: $P(K=k) = \mathrm{Beta_K}[a,b]$. This is out prior. We won't explain how the Beta distribution works because it is beyond the scope of this answer. It suffices to know that it is perfect for modeling the bias of coins. For the distribution's parameters we choose $a=6$ and $b=2$ which corresponds to a coin that is heavily biased in favor of heads. So $P(K=k) = \mathrm{Beta_K}[6,2]$.

To get the posterior from the prior, we simply multiply it with the observations:

$$ P(K=k|C=\{3t, 2h\}) = P(C=h,C=h,C=t,C=t,C=t|K=k)P(K=k) $$ We simplify the right hand side the same way as we did for the expression for the MLE $$ k^2(1-k)^3\mathrm{Beta_K}[6,2] $$

Wikipedia tells us how to expand out the Beta distribution: $$ k^2(1-k)^3\frac{1}{B(6,2)}k^{6-1}(1-k)^{2-1} = \frac{1}{B(6,2)}k^7(1-k)^4 $$

Notice how similar the posterior is to the prior... Perhaps it can be turned into a Beta distribution itself?! But to get the MAP we don't need that. All that is left is deriving the above expression and setting it to 0, exactly as we did when computing the MLE: $$ D\,k^7(1-k)^4 = 7k^6(1-k)^4 - 4k^7(1-k)^3 = 0 $$ The $\frac{1}{B(6,2)}$ factor is omitted because it is non-zero and constant so it won't affect the maximum.

Solving for $k$ yields $k = 7/11$. So according to the MAP the probability of gettings heads is about 64%.

TL;DR I think two things confused you. First, the argmax syntax. My method of deriving to find the maximum of the parameter analytically works in this example, but many times it doesn't. Then you have to use other methods to find or approximate it.

Second, not only events but the parameters themselves can be thought of as random variables drawn from fitting distributions. You are uncertain of the outcome, whether a coin will land heads or tails, but you are also uncertain of the coins fairness. That is a "higher level of uncertainity" which is hard to grasp in the beginning.

Simple example of "Maximum A Posteriori"

Tags:

Statistics

Probability

Bayesian

Probability Theory

Maximum Likelihood

Related

Recent Posts