Intuitive explanation of the tower property of conditional expectation

First, recall that in $E[X|Y]$ we are taking the expectation with respect to $X$, and so it can be written as $E[X|Y]=E_X[X|Y]=g(Y)$ . Because it's a funciton of $Y$, it's a random variable, and hence we can take its expectation (with respect to $Y$ now). So the double expectation should be read as $E_Y[E_X[X|Y]]$.

About the intuitive meaning, there are several approaches. I like to think of the expectation as a kind of predictor/guess (indeed, it's the predictor that minimizes the mean squared error).

Suppose for example that $X, Y$ are two (positively) correlated variables, say the weigth and height of persons from a given population. The expectation of the weight $E(X)$ would be my best guess of the weight of a unknown person: I'd bet for this value, if not given more data (my uninformed bet is constant). Instead, if I know the height, I'd bet for $E(X | Y)$ : that means that for different persons I'd bet a diferent value, and my informed bet would not be constant: sometimes I'd bet more that the "uninformed bet" $E(X)$ (for tall persons) , sometime less. The natural question arises, can I say something about my informed bet in average? Well, the tower property answers: In average, you'll bet the same.


For simple discrete situations from which one obtains most basic intuitions, the meaning is clear.

I have a large bag of biased coins. Suppose that half of them favour heads, probability of head $0.7$. Two-fifths of them favour heads, probability of head $0.8$. And the rest favour heads, probability of head $0.9$.

Pick a coin at random, toss it, say once. To find the expected number of heads, calculate the expectations, given the various biasing possibilities. Then average the answers, taking into consideration the proportions of the various types of coin.

It is intuitively clear that this formal procedure "should" give about the same answer as the highly informal process of say repeating the experiment $1000$ times, and dividing by $1000$. For if we do that, in about $500$ cases we will get the first type of coin, and out of these $500$ we will get about $350$ heads, and so on. The informal arithmetic mirrors exactly the more formal process described in the preceding paragraph.

If it is more persuasive, we can imagine tossing the chosen coin $12$ times.


The expected value of $X$ is still the expected value of $X$ when you take into account the possible values of $Y$.