Why do lower probability messages contain more information?
You are getting there with the idea that if all cases of $m_2$ imply $m_1$ (but there are other ways for $m_1$ to be true), then $m_2$ transmits more information than $m_1$, but you don't need inclusion for that. Information received is how much better you know the state of the world. We define it as the log of the number of states to get the additivity we want. Imagine that four coins are tossed. Before you are told anything about the result, there are 16 possible states of the world. If I tell you they all came up heads (a surprising message), the number of states has been reduced by a factor 16. If I tell you they didn't all come up heads (not a surprising message), the number of states has been reduced by a factor 16/15. So in the first case, there are many fewer possible states of the world. In the second, there are almost as many possible states after the message as before.
Suppose you are to build a communication system. Every second, a randomly chosen message is delivered to you at point A and you are to make sure that with high probability it will be delivered at point B within a finite time. You know in advance which messages are possible and what their distribution are. (Successive messages are supposed to be independent).
However, bandwidth (the number of bits you can send per second) is expensive, and you want to be able to lease a channel with then minimal capacity you need to be able to deliver messages without building an ever-increasing backlog (again, with high probability).
If there are $n$ possible messages, you could meet your goal by buying enough bandwidth to send $\log_2(n)$ bits per second. But that could be wasteful -- say that 99% of the time the message is "no comments". Then you can encode that message as a single 0 bit, and send everything else as a 1-bit followed by a message number. That way you only have to buy bandwidth for about $1+\log_2(n-1)/100$ bits per second. This leaves you enough room to send a 0 each time nothing interesting happens. Once in a wile, when something interesting does happen, you send a 1 plus additional bits, which will take about $\log_2(n)$ extra seconds and build up a small backlog of messages that are all likely to be single 0's. But since you can send those 0's slightly faster than one per second, on average you can expect to have your backlog cleared by the time the next interesting thing happens.
(There are safety-margin refinements from queueing theory here that I won't go into).
The moral of this example is that if you designing a coding system and are interested in minimizing the expected number of bits necessary to send a message, you can "afford" to spend more bits on a rare message because you don't have to do it so often.
And it turns out that in the limit as $N\to\infty$, the lowest possible expected number of bits to send $N$ independent identically distributed messages (where the minimum is taken over all possible coding strategies), is exactly $N$ times the Shannon information content of the probability distribution.
You are at "Sahara desert". Then you meet someone and the guy says:
- Today it did not rain!
How much information did you get? A few? A lot? Why?