Intuitive explanation of entropy
Here's one mildly informal answer.
How surprising is an event? Informally, the lower probability you would've assigned to an event, the more surprising it is, so surprise seems to be some kind of decreasing function of probability. It's reasonable to ask that it be continuous in the probability. And if event $A$ has a certain amount of surprise, and event $B$ has a certain amount of surprise, and you observe them together, and they're independent, it's reasonable that the amount of surprise adds.
From here it follows that the surprise you feel at event $A$ happening must be a positive constant multiple of $- \log \mathbb{P}(A)$ (exercise; this is related to the Cauchy functional equation). Taking surprise to just be $- \log \mathbb{P}(A)$, it follows that the entropy of a random variable is its expected surprise, or in other words it measures how surprised you expect to be on average after sampling it.
Closely related is Shannon's source coding theorem, if you think of $- \log \mathbb{P}(A)$ as a measure of how many bits you need to tell someone that $A$ happened.
We want to define a measure of the amount of information a discrete random variable produces. Our basic setup consists of an information source and a recipient. We can think of our recipient as being in some state. When the information source sends a message, the arrival of the message causes the recipient to go to a different state. This "change" is exactly what we want to measure.
Suppose we have a set of $n$ events with respectively the following probabilities
$$p_1,p_2,...,p_n.$$
We want a measure of how much choice we are to make, how uncertain are we?
Intuitively, it should satisfy the following four conditions.
Let $H$ be our "measure".
$H$ is continous at every $p_i$
If $p_i = 1$, then $H$ is minimum with a value of $0$, no uncertainty.
If $p_1 = p_2= \dots = p_n$, i.e. $p_i=\frac{1}{n}$, then $H$ is maximum. In other words, when every outcome is equally likely, the uncertainty is greatest, and hence so is the entropy.
If a choice is broken down into two successive choices, the value of the original $H$ should be the weighted sum of the value of the two new ones.
An example of this condition $4$ is that $$H\left(\frac1{2}, \frac1{3}, \frac{1}{6} \right) = H\left(\frac{1}{2}, \frac{1}{2} \right) + \frac{1}{2} H\left(1 \right) + \frac{1}{2} H\left(\frac{2}{3}, \frac{1}{3} \right)$$
Here we decided to either take the a) first element or b) one of the other two elements. Then in a) we had no further decision, but for b) we had to decide which of those two to take.
The only $H$ satisfying the conditions above is:
$$H = −K\sum^n_{i=1}p_i log(pi)$$
To see that this definition gives what we intuitively would expect from a "measure" of information, we state the following properties of $H$.
- $H = 0 \iff p_i = 1$ and $p_j= 0, \forall j \neq i$
- $\forall n \in N$, $H$ is maximum when $p_1=,\cdots,= p_n$
Suppose $x$ and $y$ are two events with $x \in R^n$, $y \in R^m$ and $p(i,j)$ is the probability that $x$ and $y$ jointly occur (i.e. occur at the same time).
$H(x, y) = −\sum_{i, j} p(i, j) \log(p(i, j))$
$H(x, y) \leq H(x) + H(y)$.
With equality only if the occurrences are independent.
$H_x(y) = −\sum_{i, j} p_i(j) \log(p_i(j))= H(x, y) − H(x).$
The entropy of $y$ when $x$ is known.
$H(y) \geq H_x(y)$.
The entropy of $y$ is never increased by knowing $x$.
Any change towards equalization of the probabilities increases $H$. Greater uncertainty $\Rightarrow$ greater entropy.
Here is a post with some illustrative R code
Let me give you an intuitive (rather than mathematically rigorous) interpretation of entropy, denoted $H(X)$.
Let me start by giving you my interpretation first and then let me justify it.
Entropy can be viewed as the cost of encoding a specific distribution $X$.
Since I will describe it in terms of encoding messages, let me change the notation to make the description more intuitive. We want to transmit some message $(M=m)$ across some channel $C$. Intuitively, the cost of sending a messages across a channel is the length of the encoding of the message $m$. i.e. the longer the message, the more it will cost us to send the message since we have to send more (bits) of information. The frequency (and the probability) of getting each message is dictated by the language $\mathcal{L}$ which the message came from. For example, the language could be $\mathcal{L} = English$, were the word "the" is probably relatively common (i.e. high frequency and high probability) and thus, we should choose wisely how to encode this, since we will have to send it very often (or in the case of English, write it pretty pretty often!). So we want an efficient encoding for "the". By efficient, we want it to mean choosing a encoding that happens to choose less number of "stuff" (or information, bits etc) that we need to send through the channel. Since the messages we have to send are somewhat random, then its seems reasonable that we aim to send the least amount of bits that we can, at least on average. i.e intuitively, we want to minimize:
$$ E[ |M|] = \sum_m Pr[M=m]|m|$$
where $|m|$ denotes the length of the encoding of message m.
For example, we might want to encode it this way: for common (high probability) messages lets use fewer bits of information to encode them since we have to send them very frequently. So we can encode them based of the relative frequency dictated by the distribution for $\mathcal{L}$. With a little more thought you can come up with Huffman coding or some other scheme similar to it, if you make sure that the messages can be decoded unambiguously, the main idea in my opinion is to encode frequent words with short code lengths and infrequent ones with longer code lengths.
It turns out that Shannon proved that the notion of entropy provides a precise lower bound for the expected number of bits required to encode instances/messages sampled from $P(M)$. i.e. if we consider any proper codebook for values of $M \in \mathcal{L}$, then the expected code length, relative to the distribution $P(M)$, cannot be less than the entropy $H(M)$:
$$H(M) \leq E[|M|]$$
Since there exists a scheme that makes this inequality tight, then we can expect to encode the messages $M$ as efficiently as possible (on average).
Thus, returning to the interpretation I suggested. Since, the cost of encoding something can be thought of as the number of bits we need to send through a channel, and the optimum value (entropy) can be achieved, then entropy becomes the expected cost of encoding a distribution of messages.
(or if you want to view it from the inequalities perspective, its the best/minimum expected cost you can have to encode any known distribution $P(M)$.)